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The  purpose  of  this  research  is  to  develop  a new  articulatory  speech 
synthesizer  and  to  identify  its  key  control  parameters  for  producing  high-quality 
nasal  sounds  and  female  voices.  The  proposed  synthesizer  is  a useful  tool  for 
research  involving  both  the  production  and  the  perception  of  speech.  The  techniques 
developed  in  this  research  effort  are  directly  applicable  not  only  to  the  synthesis  of 
natural-sounding  speech,  but  to  the  establishment  of  rules  governing  text-to-speech 
systems  as  well. 

The  proposed  articulatory  speech  synthesizer  is  stable,  flexible  and 
computationally  efficient.  The  trapezoidal  method  of  differential  equation  solving  is 
utilized  to  assure  the  stability  of  the  synthesizer,  since  this  research  effort  proves  that 
the  acoustic  equations  of  the  vocal  system  are  essentially  stiff  differential  equations. 
By  using  the  associated  discrete  circuit  model  and  circuit  analysis  theory,  the  acoustic 
equations  were  simplified.  This  resulted  in  a significant  reduction  in  the  amount  of 
time  required  for  computation.  The  synthesizer,  itself,  is  designed  in  such  a way  that 
the  user  can  readily  specify  those  parameters  related  to  the  vocal  folds,  the  wall  of 
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the  vocal  tract  and  the  shape  of  the  nasal  tract;  and  thereby  offering  a flexibility  that 
is  desirable  of  research  tools. 

Synthetic  voices  containing  nasal  sounds  were  produced  by  the  articulatory 
synthesizer  with  a rather  good  nasal  quality.  Data  from  the  experiments  involving 
the  synthesis  of  nasal  sounds  suggest  that  the  velopharyngeal  opening  is  the  main 
control  parameter  corresponding  to  nasality,  and  that  the  inclusion  of  the  maxillary 
sinus  cavities  in  the  nasal  tract  affects  the  quality  of  synthetic  nasal  sounds  to  a 
minimal  degree  only. 

The  voice  conversion  method  was  used  to  investigate  and  to  identify  the  key 
control  parameters  for  the  synthesis  of  female  voices.  The  results  showed  that  three 
control  parameters  (pitch,  vocal  tract  shape,  and  the  vibratory  pattern  of  the  vocal 
folds)  must  be  correctly  provided  in  order  to  synthesize  high-quality  female  voices. 
The  perceptual  evaluation  also  revealed  that  the  degree  of  breathiness  was  strongly 
correlated  to  the  waveform  of  the  glottal  area  function  and  to  the  noise  generated  at 
the  glottis. 
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CHAPTER  1 
INTRODUCTION 


As  one  of  the  most  active  branches  of  speech  communication,  speech 
synthesis  involves  the  generation  of  speech  by  a machine  according  to  a phonetic 
transcription  of  a message.  A brief  overview  of  both  the  history  and  the  applications 
of  speech  synthesis  is  provided  in  the  first  two  sections  of  this  chapter  which  are 
followed  by  a more  in  depth  discussion  of  the  three  most  widely  used  synthesizer 
models  including  the  model  chosen  for  this  investigation  — namely,  the  articulatory 
synthesizer.  The  chapter  concludes  by  defining  both  the  scope  and  the  goals  of  this 
research  effort. 


History  of  Speech  Synthesis 

For  years  men  have  been  curious  about  human  speech  production.  This 
curiosity  led  to  an  investigation  into  whether  or  not  speech  could  be  artificially 
simulated  or  synthesized.  The  first  attempt  at  speech  sound  reproduction  was  made 
by  von  Kempelen  [Flanagan,  1972a].  Demonstrated  in  1791  in  Vienna,  his 
mechanical  speaking  machine  imitated  vowel  sounds  and  a number  of  consonant 
sounds  including  nasals.  Typical  of  such  prototypes,  von  Kempelen’s  machine  was 
far  from  perfect,  but  its  underlying  theories  paved  the  way  for  the  future  exploration 
of  speech  synthesis. 

One  of  the  first  electrical  speech  synthesizers  was  demonstrated  in  1936  at  the 
Harvard  Tercentenary  by  Homer  Dudley  [1939].  His  vocoder  (or  voice  coder) 
automatically  generated  speech  using  a set  of  electrical  currents  which  were 
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instantaneously  derived  from  the  articulated  sounds.  The  significance  of  Dudley’s 
devices  lies  in  the  fact  that  it  showed  that  efficient  coding  methods  were  available  not 
only  for  the  transmission  of  voice  signals,  but  for  the  storage  of  those  signals  as  well. 

The  PATTERN  PLAYBACK  [Cooper  et  al.,  1951]  which  appeared  in  1950  at 
Haskins  Laboratories  is  the  first  example  of  a modern  speech  synthesizer. 
Schematic  evolutions  of  formant  frequencies  (a  crude  spectrogram)  were  drawn  on  a 
glass  plate,  then  scanned  to  produce  speech.  This  device,  known  as  an 
optical-electrical  speech  synthesizer,  produced  the  sound  described  by  the 
spectrogram.  The  extensive  use  of  the  spectrogram  coupled  with  the  PATTERN 
PLAYBACK  device  greatly  promoted  the  study  of  both  speech  production  and  speech 
perception. 

The  first  electrical  analogs  of  the  vocal  tract  were  the  static  simulators  [Dunn, 
1950;  Stevens  et  al.,  1953]  presented  in  the  early  1950s.  Based  on  a quantitative 
understanding  of  vocal-tract  acoustics,  these  simulators  model  the  vocal  tract  as  an 
electrical  transmission-line.  Although  these  simulators  only  produce  sustained 
vowels,  they  are  considered  the  prototypes  of  modern  articulatory  synthesizers. 

The  1960s  witnessed  the  rapid  evolution  of  digital  computers,  digital  signal 
processing  theory,  and  integrated  circuits.  The  resulting  conceptualization  and 
design  of  computer-simulated  speech  synthesizers,  which  led  to  the  development  of 
formant,  linear  predictive  and  articulatory  synthesizers,  were  a natural  outgrowth  of 
such  an  evolution.  To  date,  these  three  synthesizer  models  are  still  the  subject  of 
extensive  investigation. 


Applications  of  Speech  Synthesis 

Speech  synthesis  provides  the  bases  for  various  application  possibilities. 
Communication  engineers  view  speech  synthesis  as  a means  of  efficient 
voice-information  transmission;  and  therefore,  a method  by  which  channel  capacity 
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can  be  conserved.  An  early  application  of  speech  synthesis  is  found  in  speech  coding 
systems  called  vocoders.  The  vocoder  has  useful  applications  in  a variety  of 
analysis-synthesis  systems  involving  the  transmission,  storage,  and  encryption  of 
speech  signals. 

The  use  of  digital  techniques  and  computer  technology  in  speech  synthesis  has 
opened  vast  possibilities  for  machine  assistance  to  humans,  such  as  voice  response 
systems.  The  primary  mode  of  communication  seemingly  preferred  by  humans  is 
that  of  their  naturally  spoken  language.  The  advantages  of  speech  over  a visual 
display  are  numerous:  (1)  a listener  is  not  tied  to  a terminal;  (2)  speech  draws  one’s 
attention  more  than  the  written  text;  and  (3)  understanding  speech  does  not  require 
much  user  training.  These  reasons  help  to  explain  the  considerable  amount  of 
attention  that  voice  response  systems  have  recently  received.  Around  1970,  speech 
synthesis  for  computer  output  developed  into  a more  important  research  stimulus 
than  vocoder  design.  Synthetic  speech  with  respect  to  most  voice  response  systems  is 
automatically  produced  according  to  an  input  text.  This  technique  is  called 
text-to-speech  synthesis  [Klatt,  1987]  and  its  applications  include  talking  aids  for  the 
vocally  handicapped,  reading  aids  for  the  visually  impaired,  training  aids,  automatic 
answering  systems,  and  talking  terminals. 

As  speech  synthesis  techniques  continue  to  develop,  people  will  become 
increasingly  discontent  with  the  metallic  sounding  but  intelligible  speech  from  speech 
synthesizers;  and  it  is  precisely  this  discontentment  which  signals  the  need  for 
improvement  in  the  quality  of  the  synthetic  speech.  The  factors  corresponding  to 
speech  quality  can  be  determined  by  varying  the  selected  parameters  in  a controlled 
fashion  and  by  listening  to  the  resultant  synthetic  speech.  This  approach,  sometimes 
referred  to  as  speech-to-speech  synthesis  [Childers  et  al.,  1989],  is  extremely 
relevant  to  research  involving  speech  production  and  speech  perception. 
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Speech  Synthesis  Models 

Each  speech  synthesis  strategy  is  based  on  a model  of  the  vocal  source  and  the 
vocal  tract.  The  most  relied  upon  models  appear  to  fall  into  three  categories  of 
speech  synthesizers:  linear  prediction,  formant,  and  articulatory.  Selection  of  the 
appropriate  model  largely  depends  upon  a “best  fit”  method  in  which  the  models’ 
advantages  and  disadvantages  are  weighed  against  each  other  with  respect  to  the 
nature  of  the  research. 


Linear  Prediction  Speech  Synthesizers 

In  1970,  Atal  [1970]  first  “coined”  the  term  linear  prediction  for  speech 
analysis.  Details  of  this  new  approach,  linear  predictive  coding  (LPC),  to  speech 
analysis  and  synthesis  were  published  by  Atal  and  Hanauer  [1971]  in  1971.  The 
basic  premise  behind  the  LPC  is  that  each  speech  sample  can  be  represented  as  a 
linear  combination  of  its  past  values  and  the  current  value  of  the  input.  An  LPC 
speech  synthesizer  consists  of  an  excitation  source  and  a time-varying  all-pole  filter 
(Figure  1-1).  The  all-pole  filter  contributes  to  the  short-time  spectral  envelope, 
while  the  fine  structure  is  created  by  the  source  of  excitation. 

Since  the  LPC  approach  to  speech  analysis  focuses  on  the  representation  of 
the  short-time  spectral  envelope  of  the  speech  signal,  it  says  nothing  about  the  role 
excitation  plays  in  the  proper  synthesis  of  speech.  Conventional  LPC  synthesizers 
are  developed  from  the  traditional  excitation  model  — pitch  pulse  and  white  noise  — 
which  is  widely  used  because  it  is  the  only  way  of  synthesizing  speech  at  low  bit  rates 
in  the  vicinity  of  2 Kbits/sec  [Makhoul  et  al . , 1985].  The  synthetic  speech  produced 
by  this  model  is  intelligible  but  exhibits  unnatural  characteristics.  To  improve  the 
sound  quality,  the  conventional  impulse  train  excitation  can  be  replaced  by  a 
waveform  that  approximates  the  glottal  volume  velocity  [Childers  et  al.,  1985;  1987; 
1989]. 
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Figure  1-1.  Block  diagram  of  LPC  synthesizer. 
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It  is  a difficult  task  to  reliably  classify  the  short  segments  of  a speech 
waveform  into  voiced  and  unvoiced  categories  [Hess,  1982],  but  the  quality  of 
synthetic  speech  produced  by  a conventional  LPC  speech  synthesizer  depends  on  an 
accurate  separation  of  speech  into  these  two  categories.  The  use  of  multiple  pulses 
[Atal  and  Remde,  1982]  or  stochastic  codes  [Schroeder  and  Atal,  1985]  to  represent 
the  excitation  source  avoids  this  difficulty  and  improves  the  quality  of  synthetic 
speech.  The  multi-pulse  LPC  synthesizer  requires  a few  pulses  (8  pulses  every  10 
msec)  to  generate  different  kinds  of  speech  sounds  — including  voiced  and 
unvoiced  — with  little  audible  distortion.  Instead  of  minimizing  a mathematical 
root-mean-square  error,  both  methods  minimize  the  subjective  loudness  of 
quantization  noise  as  perceived  by  the  human  ear  in  the  presence  of  the  speech 
signal. 

The  advantages  which  make  the  LPC  speech  synthesizer  quite  popular  are  (1) 
the  few  parameters  required  to  control  the  synthesizer;  (2)  its  analysis  process  is 
entirely  automatic  and  its  fast  algorithms  are  available  for  calculating  the  LPC 
coefficients;  (3)  the  synthesis  process  is  relatively  straight  forward;  and  (4)  the 
synthetic  speech,  itself,  is  intelligible. 

The  disadvantages  of  the  LPC  synthesizer  are  (1)  it  cannot  properly  produce 
nasals,  fricatives  and  stop  consonants;  (2)  the  speech  generated  by  LPC  synthesizers 
often  sounds  “buzzy”;  and  (3)  the  parameters  have  little  or  no  relation  to  the 
anatomy  or  physiology  of  speech  production. 


Formant  Speech  Synthesizers 

Formant  synthesizers  are  based  on  the  source-tract  speech  production  model 
[Fant,  I960].  They  consist  of  an  excitation  source  and  a number  of  resonators 
(Figure  1-2).  The  resonators,  whose  resonance  frequencies  and  bandwidths  can  be 
varied,  model  the  frequency-transmission  characteristics  of  the  vocal  tract  between 
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Figure  1-2.  Block  diagram  of  formant  synthesizer. 
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the  glottis  and  the  mouth.  The  source,  which  consists  of  either  a pitch  impulse 
generator  or  a noise  generator,  provides  the  excitation  for  the  resonators. 

The  vocal-tract  transfer  function  can  be  reduced  to  either  a series  of  complex 

' •• 

pole-pair  networks  or  a parallel  addition  of  complex  pole-pair  networks  and,  thus, 
defines  the  two  basic  kinds  of  formant  synthesizers:  the  cascade  synthesizer  and  the 
parallel  synthesizer.  The  best  formant  synthesizers  developed  to  date  are  a 
combination  cascade/parallel  synthesizer  by  Klatt  [1980]  and  a parallel  synthesizer 
by  Rye  and  Holmes  [1982]. 

Since  the  control  parameters  of  a formant  synthesizer  are  closely  related  to 
the  spectral  properties  of  speech  sounds,  the  formant  synthesizer  is  an  essential  tool 
for  studying  speech  perception  through  the  synthesis  of  speech.  It  is  for  this  same 
reason  that  it  is  often  used  in  synthesis-by-rule  systems  [Klatt,  1987]. 

When  formant  synthesizers  are  used  in  analysis-synthesis  systems,  the 
formant  information  is  extracted  from  the  original  speech  signal.  The  main  problem, 
however,  is  the  accuracy  of  the  analysis  methods  used  for  acquiring  the  formant 
information  from  the  original  speech  signal.  Since  missed  peaks  (for  example,  two 
formants  have  merged  into  one  peak)  and  spurious  peaks  in  the  autoregressive  (AR) 
spectrum  can  throw  conventional  formant  trackers  completely  off  [McCandless, 
1974],  formant  tracking  is  one  of  the  more  difficult  problems  of  speech  analysis  and 
the  total  automation  of  formant  tracking  still  remains  unresolved.  Formant  tracking 
of  the  female  voice  is  more  difficult  than  that  of  the  male  voice  because  the 
fundamental  frequency  of  the  female  voice  is  higher  than  that  of  the  male  voice.  Two 
facts  explain  why  formant  tracking  of  high-pitched  voices  is  more  difficult:  (1) 
higher-pitched  voices  have  relatively  widely  spaced  harmonics  and  thus  provide 
fewer  points  from  which  to  estimate  the  spectrum  envelope;  and  (2)  the  LPC 
envelope  peaks  tend  to  draw  away  from  their  true  values  toward  the  nearest  harmonic 
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peaks  [Makhoul,  1975].  With  respect  to  the  problem  of  formant  tracking  the 
problem  of  measuring  the  formant  bandwidths  is  even  more  difficult  [Pinson,  1963]. 


Articulatory  Speech  Synthesizers 

Both  LPC  and  formant  synthesizers  are  based  on  the  source-tract  model 
which  assumes  a linear  separability  between  the  excitation  and  the  vocal  tract.  While 
many  acoustical  properties  of  the  vocal  system  such  as  source-tract  interaction 
cannot  be  simulated  by  these  simplified  models,  the  articulatory  speech  synthesizers 
directly  simulate  the  generation  and  propagation  of  sound  waves  inside  the  vocal 
system. 

Articulatory  synthesizers  consist  of  two  separate  components:  an  articulatory 
model  and  an  acoustic  model  (Figure  1-3).  Articulatory  features  (movements  of 
articulators  or  configurations  of  the  vocal  tract)  and  features  related  to  phonation  are 
used  as  the  control  parameters.  Neglecting  the  difficulties  in  the  acquisition  of 
control  parameters  and  the  amount  of  computation  involved,  articulatory  speech 
synthesizers  have  the  following  advantages. 

(1)  The  control  parameters  are  directly  related  to  the  articulatory  mechanism 
which  makes  articulatory  synthesis  a valuable  tool  for  speech  production  and 
perception  studies  [Rubin  et  al.,  1981]. 

(2)  The  source-tract  interaction  can  be  properly  modeled  because  articulatory 
synthesizers  simulate  the  vocal  folds  and  the  vocal  tract  as  one  system.  It  was  found 
that  the  source-tract  interaction  is  essential  for  the  synthesis  of  high-quality, 
natural-sounding  speech  [Rothenberg,  1981;  Koizumi  et  al.,  1985]. 

(3)  The  articulatory  synthesizers  can  produce  natural-sounding  nasal 
consonants  and  nasalized  vowels  [Maeda,  1982b], 

(4)  It  is  easier  to  interpolate  the  articulatory  parameters  than  the  parameters 
of  LPC  and  formant  synthesizers  [Sondhi  and  Schroeter,  1987].  This  is  because 
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interpolated  values  for  the  control  signals  of  an  articulatory  synthesizer  are 
physically  realizable.  For  this  reason,  slightly  erroneous  control  signals  usually  do 
not  result  in  unnatural  speech. 

(5)  Articulatory  speech  synthesis  has  the  potential  for  natural  speech  output  at 
bit  rates  below  4800  bits/sec  — provided  that  proper  articulatory  parameters  are 
available  with  which  to  control  the  synthesizer  [Schroeter  et  al.,  1987]. 


Research  Goals 

Improving  the  quality  of  synthetic  speech  is  of  fundamental  concern  to 
investigators  of  speech  synthesis.  Although  intelligible  speech  can  be  produced  by 
either  LPC  or  formant  synthesizers,  the  speech  produced  does  not  sound  natural. 
There  are  two  main  factors  responsible  for  this  lack  of  naturalness.  First,  the  LPC 
and  formant  synthesis  models  do  not  simulate  the  voice  production  mechanism 
faithfully,  in  that,  the  source-tract  interaction  is  usually  neglected.  Second,  if  the 
control  parameters  are  derived  from  rules,  difficulties  arise  in  the  accurate  modeling 
of  the  dynamics  of  the  human  vocal  system  when  it  changes  from  one  phoneme  to  the 
next.  The  reproduction  of  these  transitions  is  closely  related  to  naturalness  [Coker, 
1967].  In  contrast,  articulatory  synthesizers  directly  simulate  the  voice  production 
mechanism  which  means  that  they  have  the  potential  of  producing  the  most 
natural-sounding  speech  at  low  bit  rates.  Unfortunately,  advances  in  developing 
high-quality  synthesis  using  articulatory  synthesis  models  have  been  slow  as  a direct 
result  of  the  lack  of  reliable  physiological  data,  especially  with  respect  to  females  and 
children,  and  of  the  difficulty  involved  in  modeling  all  relevant  factors  in  the  acoustic 
production  process  [Fant,  1980].  The  purpose  of  this  research  effort  was  to  develop 
a flexible  articulatory  speech  synthesizer  as  an  effective  research  tool  in  order  to 
investigate  the  factors  related  to  the  quality  of  synthetic  speech. 
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The  first  goal  of  this  research  involves  the  development  of  a flexible,  stable, 
and  computationally  efficient  articulatory  speech  synthesizer  as  a research  tool.  In 
order  to  determine  how  the  parameters  of  the  acoustic  model  affect  the  quality  of 
synthetic  speech,  a flexible  articulatory  speech  synthesizer  whose  parameters  can  be 
readily  changed  is  needed.  To  obtain  synthetic  speech  from  the  articulatory  speech 
synthesizer,  a set  of  ordinary  differential  equations  needs  to  be  solved  by  using 
numerical  methods.  Unfortunately,  the  solution  of  these  equations  contains 
components  subject  to  both  rapid  and  slow  changes;  and  thus,  the  problem  of 
stability  arises.  The  lack  of  a formalized  analysis  procedure  means  a trial-and-error 
method  must  be  adapted  to  produce  synthetic  speech  with  high  quality.  A fast 
algorithm  is  required  to  do  these  computations  efficiently. 

The  second  goal  of  this  research  involves  the  location  of  the  primary  factors 
relating  to  the  perception  of  nasality.  The  nasal  consonants  have  a high  frequency  of 
occurrence  in  many  languages  (in  fact,  about  11  percent  for  English)  [Mori  et  al., 
1979].  For  the  production  of  nasal  consonants,  the  velum  is  lowered  which  leaves  the 
entrance  to  the  nasal  cavities  open.  The  inclusion  of  the  nasal  cavities  in  the 
resonance  system  introduces  new  acoustic  features  into  the  nasal  sounds.  For 
example,  the  spectra  of  nasals  contain  both  poles  and  zeros.  Since  LPC  speech 
synthesizers  are  based  on  the  all-pole  model,  the  resultant  synthetic  nasals  are  not  of 
high  quality.  Using  a parallel-configured  formant  synthesizer,  zeros  can  be 
introduced  between  formant  peaks;  however,  the  frequency  and  bandwidth  of  these 
zeros  are  not  under  the  control  of  the  user.  Therefore,  only  the  articulatory  speech 
synthesizer  can  properly  produce  nasals  and  nasalized  vowels  by  correctly  modeling 
the  nasal  tract  and  simulating  the  movement  of  the  velum. 

Since  almost  all  existing  speech  synthesizers  are  scaled  after  a male  prototype 
[Fant,  1980],  the  third  research  goal  is  to  identify  the  important  factors  affecting  the 
synthesis  of  female  voices.  It  is  a well  established  fact  that  the  vocal  folds  and  the 
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vocal  tracts  of  the  male  and  the  female  are  different  [Hirano  et  al.,  1983;  Fant,  1973] 
and  as  such  present  a different  set  of  problems  in  terms  of  speech  synthesis.  The 
effects  of  these  differences  are  investigated  in  this  research. 


CHAPTER  2 

ARTICULATORY  SPEECH  SYNTHESIZER 


To  implement  an  articulatory  speech  synthesizer  on  a digital  computer,  a 
mathematical  model  of  the  vocal  system  is  required.  A block  diagram  of  the 
proposed  articulatory  speech  synthesizer  is  shown  in  Figure  2-1.  It  contains  an 
articulatory  model  and  an  acoustic  model.  The  articulatory  model  transfers  the 
positions  of  key  articulators,  such  as  the  jaw,  tongue,  lips,  and  velum,  to  the 
cross-sectional  area  function  of  the  vocal  tract.  The  acoustic  model  is  a set  of 
ordinary  differential  equations  (acoustic  equations)  that  describe  the  acoustic 
properties  of  the  vocal  system.  To  obtain  synthetic  speech  one  must  solve  the 
acoustic  equations  by  using  numerical  methods.  In  this  chapter  the  approaches  used 
in  articulatory  speech  synthesizers  are  briefly  reviewed  and  then  followed  by 
discussions  of  all  parts  of  the  proposed  synthesizer. 


Approaches  Used  in  Articulatory  Synthesizers 
Generally  speaking,  there  are  three  main  approaches  used  in  articulatory 
speech  synthesizers.  The  first  approach  is  based  on  the  assumption  that  the  vocal 
tract  can  be  represented  as  a concatenation  of  tubes  with  frequency-independent 
losses;  thus,  the  pressures  and  flows  along  the  tubes  can  be  put  in  terms  of  forward 
and  backward  traveling  waves.  Kelly  and  Lochbaum  [1962]  presented  their  K-L 
model  to  simulate  wave  propagation  in  the  vocal  and  nasal  tracts  by  using  this 
approach.  Maeda  [1977]  and  Strube  [1982]  modified  the  K-L  model  to  include  the 
effects  of  dynamic  area  variation.  Rubin  et  al.  [1981]  modified  the  K-L  model  to 
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Figure  2-1.  Block  diagram  of  the  proposed  articulatory  speech  synthesizer. 
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represent  the  non-ideal  termination  at  the  glottis,  lips  and  nostrils.  They  calculated 
the  reflection  coefficients  and  the  transfer  function  in  the  Z-domain.  Based  on  the 
transfer  function,  digital  filters  were  designed  to  perform  the  synthesis.  In  general, 
this  approach  is  not  only  much  faster  than  the  other  two  approaches,  but  it  can  also 
take  advantage  of  the  parallel  processing  to  realize  real-time  synthesis.  This 
method,  however,  does  not  accurately  simulate  the  real  vocal  system  in  several 
aspects:  (1)  the  frequency-dependent  propagation  losses  and  frequency-dependent 
radiation  cannot  be  modeled  properly  because  the  characteristic  impedances  of  each 
tube  and  load  must  be  resistive;  (2)  the  source-tract  interaction  cannot  be  included, 
since  the  kinetic  resistance  of  the  glottis  and  the  volume  velocity  through  the  glottis 
are  related  to  each  other;  (3)  the  length  of  the  vocal  tract  cannot  be  varied  at  will 
because  the  fixed  length  of  each  tube  segment  is  related  to  the  sample  interval;  and 
(4)  the  vocal  tract  is  simulated  as  hard-walled. 

The  second  approach  models  the  acoustic  properties  of  the  glottis  and  the 
vocal  tract  by  a set  of  ordinary  differential  equations  solved  by  using  numerical 
methods  for  each  sampling  interval.  Flanagan  and  his  associates  have  published  a 
series  of  papers  [Flanagan  and  Landgraf,  1968;  Ishizaka  and  Flanagan,  1972; 
Flanagan  et  al.,  1975;  1980]  dealing  with  this  kind  of  articulatory  speech  synthesizer. 
Their  most  complete  model  contained  a two-mass  model  of  the  vocal  folds  and  a 
transmission  line  of  several  cylindrical  sections  representing  the  vocal  and  nasal 
tracts.  Each  section  of  the  transmission  line  included  the  elements  accounting  for  the 
inertance,  compliance,  viscous  loss,  heat-conduction  loss,  wall  impedance  and  wall 
radiation.  Additionally,  each  section  included  a latent  random  pressure  source  and 
an  inherent  constriction  loss.  The  model  combined  the  vocal  folds,  and  the  vocal  and 
nasal  tracts  in  a way  that  permitted  normal  acoustic  interaction  and  loading,  which  is 
believed  to  be  important  for  synthesizing  high-quality,  natural-sounding  speech 
[Rothenberg,  1981;  Koizumi  et  al.,  1985].  Since  this  approach  simulates  the  vocal 
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system  in  the  time-domain,  the  dynamic  properties  are  preserved.  Moreover,  not 
only  the  synthetic  speech,  itself,  but  also  the  distributions  of  pressures  and  volume 
velocities  along  the  vocal  tract  can  be  obtained.  These  distributions  are  helpful  for 
understanding  the  speech  production  process.  However,  this  method  is 
computationally  inefficient  and  the  frequency-dependent  losses  and  radiation  loads 
cannot  be  accurately  simulated.  To  reduce  the  computational  burden,  Maeda 
[1982a]  simplified  Flanagan’s  model  in  three  respects:  (1)  the  glottal  area  was  used 
as  the  control  parameters,  thereby  omitting  a self-oscillating  vocal  fold  model;  (2) 
the  noise  sources  within  the  vocal  tract  were  also  omitted;  and  (3)  the  loss  due  to 
viscous  and  heat-conduction  were  neglected.  Based  on  his  simplified  model,  Maeda 
took  advantage  of  a sparse  coefficient  matrix  of  the  acoustic  equations  enabling  a 
fast  algorithm.  Bocchieri  [1983]  modified  Flanagan’s  model  by  reducing  the  number 
of  noise  sources.  He  used  two  noise  excitation  generators;  one  located  at  the  glottis 
for  simulating  aspirated  sounds  and  the  other  placed  at  the  point  of  maximum 
constriction  for  producing  other  unvoiced  sounds. 

The  third  approach  is  a hybrid  time-frequency  domain  method  [Sondhi  and 
Schroeter,  1987;  Allen  and  Strong,  1985].  This  method  is  called  the  hybrid  method 
because  the  glottis  is  modeled  in  the  time  domain,  while  the  vocal  tract  is  modeled  in 
the  frequency  domain.  Since  the  frequency-dependent  losses  and  the  radiation  of 
the  vocal  tract  can  be  represented  more  accurately  in  the  frequency  domain  and  since 
the  vocal  tract  is  considered  a linear  system,  the  vocal  tract  is  more  efficiently 
modeled  in  the  frequency  domain.  In  contrast,  the  highly  nonlinear  nature  of  the 
glottis  lends  itself  to  a better  model  in  the  time  domain.  The  source  and  tract  models 
are  then  interfaced  by  the  inverse  Fourier  transformation  and  digital  convolution. 
This  approach  takes  advantage  of  both  frequency  and  time  domain  techniques  in 
order  to  obtain  a fast  and  versatile  realization,  but  the  inverse  Fourier  transformation 
only  gives  a steady  state  solution;  thus,  the  dynamic  behavior  is  missing.  For 
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example,  this  method  cannot  synthesize  high-quality  sound  with  fast  transitions  like 
the  ones  that  occur  in  stop  sounds.  The  output  of  this  synthesizer  is  a speech 
waveform  only.  No  information  about  the  pressures  and  the  volume  velocities  within 
the  vocal  tract  are  calculated. 

To  study  how  the  parameters  of  an  articulatory  speech  synthesizer  affect  the 
quality  of  synthetic  speech,  one  must  use  a synthesizer  which  simulates  the  main 
aspects  of  the  acoustic  properties  of  the  vocal  system.  Such  a synthesizer  should 
include  the  source-tract  interaction  which,  as  mentioned  previously,  is  important  for 
natural-sounding  synthetic  speech.  This  means  that  the  first  approach  is  effectively 
eliminated  as  a viable  method  of  investigation  for  this  research  effort;  nor  is  the  third 
approach  adequate  because  the  synthesizer  has  to  produce  all  sounds  not  just  some. 
Since  the  second  approach  models  the  glottal  source  and  the  vocal  tract  by  a 
corresponding  set  of  differential  equations,  it  simulates  the  dynamic  property  of  the 
vocal  system  more  accurately  and  gives  more  acoustic  information.  This  approach 
was  chosen  with  consideration  given  to  speeding  up  the  computations. 


Articulatory  Model 

For  articulatory  speech  synthesis  the  vocal  tract  can  be  simplified  as  a 
cylindrical  pipe  of  non-uniform  cross-sections  whose  physical  dimensions  are 
completely  described  by  its  cross-sectional  areas,  A(x),  as  a function  of  the  distance 
x along  the  tube.  This  simplification  does  not  introduce  significant  error  because  (1) 
most  of  the  energy  of  speech  sounds  is  contained  in  the  frequency  range  between  80 
to  8000  Hz  [Dunn  and  White,  1940],  but  the  speech  quality  is  not  significantly 
affected  when  frequencies  below  5000  Hz  are  retained  [Klatt,  1980];  in  this 
frequency  range  the  cross-sectional  dimensions  of  the  vocal  tract  are  sufficiently 
small  compared  to  the  sound  wavelength,  so  the  departure  from  the  plane  wave  is  not 
significant;  and  (2)  Sondhi  [1986]  pointed  out  that  for  typical  dimensions  of  the 
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vocal  tract  the  difference  between  non-bent  and  bent  tube  in  resonance  frequencies 
below  4 KHz  is  in  the  range  of  2%  to  8%. 

The  articulatory  model  describes  the  vocal  tract  shape  in  terms  of  articulatory 
variables  which  specify  the  position  of  the  jaw,  hyoid,  tongue  body,  tongue  blade, 
lips,  velum,  etc.  in  the  midsagittal  plane.  The  articulatory  model,  though  simple, 
captures  the  essential  ingredients  of  human  articulation.  The  positions  of  these  key 
articulators  determine  the  outline  of  the  vocal  tract  in  the  midsagittal  plane.  From 
this  outline  the  width  function  and,  subsequently,  the  cross-sectional  area  function  of 
the  vocal  tract  are  determined. 

Coker  and  Fujimura  [1966]  introduced  an  articulatory  model  with  parameters 
assigned  to  the  tongue  body,  tongue  tip,  and  velum.  Coker  [1976]  later  modified  this 
model  by  using  time  constants  with  which  the  individual  articulators  responded  to 
particular  commands.  In  the  interim  another  articulatory  model  was  designed  by 
Mermelstein  [1973].  His  well  documented  model  can  be  adjusted  to  more  accurately 
mimic  the  midsagittal  X-ray  tracings.  Deviations  from  the  real  midsagittal  X-ray 
tracing  are  not  considered  acoustically  significant  when  compared  to  the  error 
involved  in  estimating  the  area  from  the  sagittal  distances. 

Mermelstein’ s model  provided  the  basis  for  the  articulatory  model  used  in  this 
research  (Figure  2-2)  in  which  the  five  primary  articulators:  tongue  body,  tongue  tip, 
jaw,  lips,  and  velum,  are  movable.  While  the  movements  of  the  jaw,  tongue,  and 
velum  are  independent  of  each  other,  the  positions  of  the  lips  depend  upon  the 
positions  of  the  other  articulators.  Movements  of  the  jaw  and  the  velum  have  one 
degree  of  freedom,  while  all  other  articulators  move  with  two  degrees  of  freedom. 
The  movement  of  the  velum  has  two  effects:  (1)  it  modulates  the  size  of  the  coupling 
port  to  the  fixed  nasal  tract;  and  (2)  it  alters  the  shape  of  the  oral  branch  of  the  vocal 
tract.  The  specification  of  these  five  key  articulator  positions  completely  determines 
the  outline  of  the  vocal  tract  in  the  midsagittal  plane. 
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Figure  2-2.  Articulatory  model  of  the  vocal  cavities. 
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Once  the  articulatory  positions  have  been  specified,  the  cross-sectional  areas 
are  calculated  by  superimposing  a grid  structure  on  the  vocal  tract  outline  (Figure 
2-3).  The  points  of  intersection  between  the  outline  and  the  grid  line  are  calculated 
first,  then  the  center  line  of  the  vocal  tract  is  formed  by  connecting  the  center  points 
of  the  adjacent  grid  lines.  The  length  of  the  center  line  is  considered  equivalent  to  the 
length  of  the  vocal  tract.  The  sagittal  distances  gjS  are  eventually  converted  into 
cross-sectional  areas  by  following  formulas  [Mermelstein,  1973].  The 
cross-sectional  area  in  the  pharyngeal  region  is  approximated  as  an  ellipse  with  gj  as 
one  axis  and  the  other  increasing  from  1.5  to  3 cm  as  one  moves  upward  from  the 
larynx  tube  to  the  velopharynx.  In  the  soft-palate  region  the  area  is  taken  as  2gi1-5,  in 
the  hard-palate  region  as  1.6gj1,5  and  between  the  alveolar  ridge  and  incisors  as  1.5gj 
for  gi  < 0.5,  0.75+3(gj-0.5)  for  0.5  < gj  < 2 and  5.25+5(gi-2)  for  gj  > 2.  In  the  labial 
region  the  area  is  assumed  elliptical  with  width  in  centimeters  given  by  2+1.5(s,-pt) 
where  p,  is  the  lip  protrusion  and  st  the  vertical  lip  separation. 


Acoustic  Models  of  the  Vocal  System 
This  section  reviews  the  various  models  of  the  glottal  source,  vocal  tract,  and 
radiation  along  with  their  respective  advantages  and  disadvantages.  Based  on  these 
reviews,  the  acoustic  models  for  this  research  are  established. 

The  Glottal  Source  Model 

Generally,  there  are  two  glottal-source  models:  (1)  a model  with  acoustic 
parameters  and  (2)  a self-oscillating  model  of  the  vocal  folds.  The  glottal  source 
model  with  acoustic  parameters  expresses  the  shape  of  the  glottal  waveform  in  terms 
of  closed  forms.  Fant  [1979]  proposed  a dynamic  three  parameter  glottal  pulse 
model.  In  order  to  include  a final  return  phase  after  the  discontinuity  point  at 
closure,  Ananthapadmanabha  [1982]  designed  a five-parameter  model  which  was 
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Figure  2-3.  Grid  system  for  the  conversion  of  mid-sagital  dimensions  to 
cross-sectional  area  values. 
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based  directly  on  the  unintegrated  inverse  filter  output.  To  optimize  the  number  of 
parameters  needed  for  a reasonable  approximation  to  the  actual  flow  conditions, 
Fant  et  al.  [1985]  proposed  a four-parameter  model.  When  these  glottal  source 
models  are  used  in  speech  synthesizers,  the  calculations  involved  are  simple,  but  the 
source-tract  interaction  cannot  be  simulated. 

In  contrast,  the  self-oscillating  model  of  the  vocal  folds  simulates  the  glottal 
behavior  in  terms  of  a mechanical  system  and  an  aerodynamic  system;  and  thus,  the 
source-tract  interaction  is  included  in  this  model.  Making  an  adequate,  yet 
mathematically  feasible,  model  of  the  vocal  folds  is  a difficult  task.  Models  with 
several  elements  are  capable  of  handling  the  gross  features  of  spatially  varying  tissue 
parameters,  nonlinear  effects  and  irregular  geometrical  configurations,  but  they 
oversimplify  the  basic  stress-strain  mechanism  and  the  boundary  conditions. 
Although  continuum  models  can  adequately  describe  the  distributive  effects,  the 
boundary  conditions,  and  the  large  variety  of  vibrational  patterns,  they  are  extremely 
complex. 

Several  models  have  been  proposed  to  simulate  vocal  fold  vibrations.  The 
one-mass  model  [Flanagan  and  Landgraf,  1968]  has  a single  horizontal  degree  of 
freedom.  A schematic  diagram  of  the  one-mass  glottal  system  is  shown  in  Figure 
2-4.  When  the  one-mass  model  is  used  in  the  articulatory  synthesizer,  the 
synthesizer  is  capable  of  producing  an  acceptable  voiced— sound  as  well  as  simulating 
several  properties  of  glottal  flow  such  as  glottal  area  and  volume  velocity.  However, 
this  particular  model  produces  too  much  fold-tract  interaction,  it  cannot  sustain 
oscillation  for  the  capacitive  input  load  of  the  vocal  tract,  and  a phase-difference  in 
the  motion  of  the  fold  edges  is  lost  [Ishizaka  and  Flanagan,  1972]. 

As  developed  by  Ishizaka  and  Flanagan  [1972]  the  two-mass  model  (Figure 
2-5)  eliminates  these  drawbacks.  In  this  model,  each  vocal  fold  is  divided  into  upper 
and  lower  parts  with  each  part  consisting  of  a nonlinear  mechanical  oscillator.  The 
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Figure  2-4.  A schematic  diagram  of  the  one-mass  model. 
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Figure  2-5.  A schematic  diagram  of  the  two-mass  model. 
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two  masses  of  the  fold  are  coupled  by  a linear  spring.  This  model  mimics  the  actual 
physiological  behavior  of  the  vocal  folds  [Guerin,  1985;  Cranen  and  Boves,  1985a; 
1985b] . For  example,  the  phase  differences  between  the  upper  and  lower  fold-edges 
correspond  to  the  motions  observed  in  high-speed  photography. 

Ishizaka  and  Flanagan  [1977]  examined  the  acoustic  significance  of 
longitudinal  displacement  in  the  self-oscillatory  behavior  of  the  vocal  folds.  They 
found  that  the  contribution  of  longitudinal  displacement  is  not  perceptually 
significant  nor  is  it  essential  for  modeling  the  dominant  acoustic  properties  of 
articulation.  In  the  two-mass  model  the  dimensions  of  the  masses  are  considered 
fixed  and  as  such  do  not  adequately  account  for  the  mucosal  surface  wave.  Koizumi 
et  al.  [1987]  proposed  three  new  two-mass  models  devised  to  better  account  for  the 
mucosal  surface  wave  as  well  as  for  the  relevant  glottal  detail  — including  vertical 
phasing.  These  recently  developed  models  tend  to  produce  a soft,  natural-sounding 
synthetic  voice. 

Although  Titze  [1973;  1974]  was  successful  in  simulating  the  motions  of 
mucosa  by  including  the  vertical  degrees  of  freedom  in  his  16-mass  model,  the 
mechanical  system  involved  is  much  more  complex.  In  general,  all  these 
self-oscillating  models  are  too  computationally  inefficient  for  the  production  of 
synthetic  speech. 

In  an  effort  to  reduce  the  burden  of  computation  while  maintaining  the 
acoustic  interaction  between  the  source  and  the  tract,  two  simplifications  were  made 
in  the  glottal  source  model  for  this  research.  First,  the  model  merely  simulates  the 
acoustic  properties  of  the  vocal  folds,  while  the  movement  of  the  vocal  folds  is 
specified  by  the  user  in  terms  of  the  glottal  area  function.  This  simplification  is 
based  on  the  fact  that  the  waveform  of  the  glottal  area  is  almost  independent  of  the 
vocal  tract  shape,  while  the  shape  can  substantially  influence  the  waveform  of  glottal 
flow  [Ishizaka  and  Flanagan,  1972].  This  is  because  the  mechanical  impedance  of 
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the  fold  system  is  much  higher  than  the  impedance  of  the  acoustic  system.  Besides, 
vertical  phasing  in  the  vocal  fold  vibration  may  also  contribute  to  making  the 
projected  glottal  area  relatively  insensitive  to  tract  loading  [Allen  and  Strong,  1985]. 
Second,  the  vertical  phasing  in  vocal  fold  vibration  is  not  simulated.  This 
simplification  is  made  possible  by  the  fact  that  the  improved  intro-glottal  pressure 
distribution  for  the  two-mass  model  is  also  applicable  to  a one-mass  formulation 
which,  in  turn,  suggests  that  a one-mass  model  with  this  modification  will  result  in 
plausible  physiological  source-tract  interactions  [Ishizaka  and  Flanagan,  1972]. 

A schematic  diagram  of  the  proposed  glottal  source  model  which  contains  the 
glottis  as  well  as  its  inlet  and  outlet  is  shown  in  Figure  2-6.  Notice  that  the  glottis, 
itself,  is  represented  by  a rectangular  slit  where  the  terms  Ag,  lg,  and  d respectively 
denote  its  area,  length,  and  depth. 

When  air  passes  through  the  abrupt  contraction  in  the  cross-sectional  area  at 
the  inlet  to  the  glottis,  eddies  are  produced.  The  loss  factor  for  such  a contraction  has 
been  studied  in  fluid  flow  experiments  [Kaufmann,  1963]  and  was  found  to  be  on  the 
order  of  0.4  to  0.5.  Flow  measurements  on  the  plaster  cast  models  of  the  larynx  set 
the  loss  figure  at  0.37  [van  den  Berg  et  al.,  1957].  This  latter  figure  is  used  to 
estimate  the  pressure  drop  at  the  inlet,  which  is  described  by 
PB(1.00  + 0.37) 

where  PB  = pug72  is  the  Bernoulli  pressure,  p is  the  air  density,  and  ug  is  the  particle 
velocity  at  the  lower  fold-edge  [Ishizaka  and  Flanagan,  1972]. 

Within  the  constriction  formed  by  the  folds,  the  pressure  drop  is  assumed  to 
be  governed  by  the  viscous  loss.  In  this  region  the  pressure  falls  linearly  with 
distance  according  to  a resistance  to  the  volume  flow.  The  amount  of  viscous 
resistance  is  equal  to 
12pdlg/Ag3 

where  p is  the  shear  viscosity  coefficient  [Flanagan,  1972b]. 
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Figure  2-6.  Schematic  diagram  of  the  glottal  source  model. 
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At  the  abrupt  expansion  of  the  glottal  outlet,  the  pressure  approaches  the 
atmospheric  value.  Based  on  Newton’s  law  of  the  conservation  of  momentum  which 
holds  in  the  theory  of  fluid  flow,  the  pressure  recovery  equals 
Pb(2N(1-N)) 

where  N = Ag/Ai,  and  Ai  is  the  input  area  to  the  vocal  tract  [Ishizaka  and  Flanagan, 
1972]. 

The  equivalent  circuit  of  the  glottal  orifice  is  shown  in  Figure  2-7.  The 
elements  of  this  acoustic  circuit  are  given  by 


A = 


12 


»lg2d 

A3 


Re  =-4 


L_  ( i _ A ) w 

2 Ag  A,  ( At  ’ 1 “ 


A = 


Q d 


where  Ug  is  the  glottal  volume  velocity. 

The  advantages  of  the  proposed  glottal  source  model  are:  (1)  the  reduction  of 
the  imposed  computational  burden  through  the  elimination  of  the  nonlinear 
mechanical  properties  of  the  vocal  folds;  (2)  the  more  accurate  user  controlled  pitch 
contour;  and  (3)  the  possibility  of  using  the  projected  glottal  area  function  (obtained 
from  measurement)  as  a control  parameter. 


The  Vocal  Tract  Model 

The  vocal  tract  is  a three-dimensional  lossy  cavity  composed  of  non-uniform 
cross-sections  and  non-rigid  walls  [Sondhi,  1974;  1986].  Although  the  appropriate 
Navier-Strokes  equation  with  the  boundary  conditions  of  the  non-rigid  walls 
describes  the  acoustic  properties  of  the  vocal  tract,  neither  the  shape  of  the  vocal 
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Figure  2-7.  The  equivalent  circuit  of  glottal  orifice. 
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tract  nor  the  physical  properties  of  the  wall  are  known  accurately  enough  to  set  up  a 
model  — in  addition  to  the  large  number  of  calculations  required  to  solve  such 
equations.  These  facts  suggest  the  need  for  simplification  in  the  acoustic  model  of 
the  vocal  tract. 

A model  of  the  vocal  tract  under  consideration  is  shown  schematically  in 
Figure  2-8,  which  consists  of  pharyngeal,  oral,  and  nasal  cavities.  As  previously 
mentioned,  the  wave  motion  in  the  vocal  tract  can  be  approximated  as  a plane  wave. 
Therefore,  only  the  cross-sectional  area  and  the  perimeter  along  the  length  of  the 
vocal  tract  determine  the  acoustic  characteristics  of  the  vocal  tract.  Thus,  the 
acoustic  equations  can  be  described  in  one  dimension  instead  of  three,  a significant 
simplification.  The  area  function  of  the  vocal  tract  is  then  approximated  by  a 
sufficiently  small  number  of  successive  sections  with  each  section  having  a constant 
cross-sectional  area. 

For  each  section  of  the  vocal  tract,  its  acoustical  model  is  derived  as  follows. 
Portnoff  [1973]  has  shown  that  sound  waves  in  the  lossless  tube  satisfy  the  following 
equations: 

dp  _ d(u/A) 
dx~  Q dt 

du  _ 1 d(pA ) dA 

dx  qc2  dt  dt 

where  p = p(x,t)  is  the  sound  pressure,  u = u(x,t)  is  the  volume  velocity,  p is  the 
density  of  air,  c is. the  velocity  of  sound,  and  A = A(x,t)  is  the  area  function  of  the 
tube.  Applying  these  two  equations  to  the  section  specified  by  the  cross-sectional 
area  A and  the  length  h yields 

dp  _ q du 
dx  A dt 

du  _ A dp 

dx  qc2  dt 
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Figure  2-8.  A schematic  diagram  of  the  vocal  tract  and  its  area  function. 
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Based  on  the  similarity  between  these  equations  and  the  equations  of  lossless, 
uniform  electrical  transmission  lines,  the  tube  can  be  represented  by  an  inductance, 
L = ph/A,  followed  by  a shunt  capacitance,  C = hA/pc*. 

Now  the  effects  of  the  vibration  of  the  vocal  tract  wall  need  to  be  added  to  the 
above  model.  The  variations  of  the  air  pressure  inside  the  tract  will  cause  the  walls  to 
experience  a varying  force.  Since  the  walls  are  elastic,  the  cross-sectional  area  of  the 
tube  will  change  depending  upon  the  pressure  in  the  tube.  Assuming  that  the  walls 
are  subject  to  local  reactions  (i.e.  the  motion  of  one  portion  of  the  wall  is  dependent 
only  upon  the  acoustic  pressure  on  that  portion  and  independent  of  the  motion  of  any 
other  part  of  the  wall),  the  area  A(x,t)  will  be  a function  of  the  pressure  p(x,t).  Since 
the  pressure  variations  are  very  small,  the  resulting  variation  in  the  cross-sectional 
area  can  be  treated  as  a small  perturbation, 

A = Aq  + AA 


where  Ao  is  the  nominal  area,  AA  is  a small  perturbation,  So  is  the  perimeter  of  the 
tube,  and  y is  the  displacement  of  the  yielding  walls  due  to  the  sound  pressure  inside 
the  tube.  Let  m,  b,  and  k represent  the  mass,  the  mechanical  resistance,  and  the 
stiffness  of  the  wall  per  unit  length  of  the  tube,  respectively.  According  to  Newton’s 
law 


Since  the  airflow  generated  by  the  wall  motion  in  the  unit  length  is  defined  by 


these  two  equations  can  be  combined  obtaining  the  resultant  equation: 


= Aq  + ySo 


The  equivalent  circuits  of  this  equation  are  the  RLC  series  circuits,  where  Lw=  m/So*, 
Cw  = S07k,  and  Rw  = b/S02. 

The  effects  of  viscous  friction  and  thermal  conduction  at  the  wall  sites  are 
much  less  pronounced  than  those  of  the  wall  vibration.  Flanagan  [1972b]  considered 
these  losses  in  detail  and  showed  that  the  effects  of  viscous  friction  can  be  accounted 
for  by  including  a frequency-dependent  resistor,  R,  in  series  with  the  inductor,  L. 
The  effects  of  heat  conduction  through  the  vocal  tract  wall  can  be  accounted  for  by 
adding  a frequency-dependent  resistor,  1/G,  in  parallel  with  the  capacitor,  C.  The 
resistor,  R,  is  significant  in  time  domain  simulation;  when  a constriction  occurs,  the 
resistance  become  very  large  and  the  air  flow  is  blocked. 

As  result,  a section  of  the  vocal  tract  may  be  represented  by  a finite  number  of 
transmission  line  elements  whose  structure  is  given  in  Figure  2-9.  The  definitions  of 
the  circuit  components  are  given  in  Table  2-1 . Then,  the  vocal  tract  model  is  built  up 
by  concatenating  these  element  models.  No  standardized  model  of  the  vocal  tract 
has  been  established  to  date  and  the  choices  for  the  component  values  vary 
inconsistently  among  researchers.  The  values  representing  the  best  choices  remain 
to  be  determined  [Wakita  and  Fant,  1978]. 

When  there  is  rapid  flow  of  air  through  a constriction  or  past  an  obstruction  in 
the  vocal  tract,  turbulent  eddies  are  created  in  the  flow.  These  random  fluctuations 
in  the  flow  act  as  a source  of  sound.  The  turbulent  noise  is  present  in  fricative 
consonants  and  immediately  following  the  release  of  plosive  consonants.  The  noise 
source  can  be  modeled  as  a sound  pressure  source,  P„,  and  its  inherent  resistance,  Rn 
[Flanagan  et  al. , 1975].  The  intensity  of  the  random  pressure  is  proportional  to  the 
square  of  the  Reynolds  number  (Re)  in  excess  of  some  critical  value.  The  squared 
Reynolds  number  is 
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Figure  2-9.  A typical  network  representation  of  an  element  in  the 
transmission  line  analog  of  the  vocal  tract. 
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Table  2-1.  The  physical  definition  of  the  circuit  components  of  the  vocal 
tract. 
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; Series  Resistance 
; Series  Inductance 
; Shunt  Capacitance 
; Shunt  conductance 
; Resistance  in  Wall  Impedance 
; Inductance  in  Wall  Impedance 
; Capacitance  in  Wall  Impedance 


where 


l : length  of  element 
p : density  of  air  ( 1.14xl0-3  g*cnr3 * ) 
c : sound  velocity  ( 35300  cm/sec  ) 

)i  : viscosity  ( 1.86xl0-4  dyne*sec*cm~2  ) 
r\  : adiabatic  gas  constant  ( 1.4  ) 

\ : coefficient  of  heat  conduction  of  air 
( 5.5 xl0~5  caUcirr^sec'^deg-1  ) 

£ : specific  heat  ( 2.4xl0_1  cal*g_1*deg_1  ) 

A : cross-sectional  area  of  element 

S : circumference  of  element 
(a  : radian  frequency 

b : mechanical  resistance  of  wall  per  unit  lenth 
m : mass  of  wall  per  unit  length 
k : stiffness  of  wall  per  unit  length 
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where  U is  a digitally  low-pass  filtered  version  of  the  volume  velocity  U at  the 
constriction: 

U(n)  = U(n  - 1)  + [U(n)  - U(n  - 1)]2 nfgT  . 

The  value  of  cutoff  frequency^  is  not  critical.  Flanagan  et  al.  [1975]  used  500  Hz  in 
order  to  ensure  stability.  The  source  resistance  Rn  is 


The  noise  source  appears  in  the  vocal  tract  network  as  sketched  in  Figure  2-10(a). 
Here,  the  impedances  Zj  and  Z2  are  the  input  impedances  seen  toward  the  glottis  and 
lips  at  the  constriction  respectively.  Sondhi  and  Schroeter  [1987]  reported  that  this 
model  did  not  give  satisfactory  results  in  synthesizing  unvoiced  sounds.  They  found 
that  even  with  Rn  = 0,  the  volume  velocity  Um  was  not  large  enough,  because  Zj  was 
much  too  high.  They  proposed  a new  noise  source  which  is  a short-circuit  noise  flow 
Un=Pn/Rn  in  parallel  with  the  vocal  tract  network  (Figure  2-10(b)).  The  position  for 
the  current  source  was  at  one  section  downstream  of  the  outlet  of  the  narrowest 
constriction  (except  the  constriction  is  at  the  lips).  The  modified  turbulent  noise 
model  was  used  in  the  proposed  articulatory  synthesizer. 

The  Radiation  Model 

The  lip  radiation  impedance  is  generally  approximated  by  the  radiation 
impedance  of  a vibrating  piston  set  in  an  infinite  baffle,  and  is  represented  by 

....  Qc  , i Ji(2ka)  .Si(2ka)  , 
z' ~ ~A  ( 1 ' ~kT  + J^~  } 

where  k = w/c,  a is  the  piston  radius,  A is  the  piston  area,  Ji(x)  is  the  Bessel  function 
of  the  first  kind,  and  Si(x)  is  the  first  order  Struve  function  [Raleigh,  1945]. 
However,  a more  representative  model  is  that  of  a vibrating  piston  set  in  a sphere. 
The  radiation  impedance  for  this  case  has  been  modeled  by  Morse  and  Ingard  [1968] 
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Figure  2-10.  Noise  source  models. 

(a)  Serial  noise  source  model. 

(b)  Parallel  noise  souse  model. 
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as 

r-j  qc  ( ka)2K(ka ) . 8 (ka)S(ka)  x 

A 2 + 1 ) 

where  K(ka)  and  S(ka)  are  complicated  functions  which  indicate  the  deviation  from 
the  case  of  a piston  in  an  infinite  baffle.  These  two  models  are  difficult  to  implement 
in  a time-domain  simulation.  Fortunately,  Flanagan  [1972b]  has  suggested  the 
parallel  circuit  approximation  where  both  the  conductance,  Gr,  and  the  susceptance, 
Sr,  are  independent  of  frequency.  The  values  of  Gr  and  Sr  are  given  by 

r = 

' 128 qc 

and 


Sr  = 


3n  -Jit A 
8 Q 


The  radiation  impedance  loads  the  end  of  the  vocal  tract  line:  the  resistive  part 
simulating  the  consumption  of  the  radiated  energy  and  the  reactance  representing  the 
effective  mass  of  vibrating  air  at  the  lips.  The  parallel  circuit  approximation  method 
was  used  in  the  proposed  synthesizer. 

Since  the  human  ear  is  sensitive  to  fluctuations  in  sound  pressure,  the  sound 
pressure  at  a distance  from  the  lips  is  the  final  variable  of  concern.  The  sound 
pressure  P(t)  at  distance  d from  the  lips  is  linearly  related  to  the  volume  velocity  U(t). 
The  precise  form  of  the  relation  depends  on  the  shape  of  the  mouth  opening  and  of 
the  head  of  the  speaker.  Assuming  the  radiation  is  uniform  in  all  directions,  the 
sound  pressure  produced  at  distance  d from  the  lips  is  given  by 


P(t) 


e dU(t-j ) 
And  dt 


where  c is  the  velocity  of  sound  and  p is  the  density  of  the  air.  A more  exact  formula 
is  given  by  Fant  [1960] 
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P_  _ 0(uKj(w) 

U 4 jtd  ' 

The  factor  KT(u)  is  a smooth  high  frequency  emphasis  of  about  1.5  db  per  octave 
from  312  Hz  to  5000  Hz.  It  represents  two  effects,  the  baffle  effect  and  the  effect  of 
increase  in  radiation  resistance  in  excess  of  the  frequency  proportionality.  As  a 
result  of  the  lack  of  experimental  verification,  is  generally  eliminated  from  the 
calculations.  The  differentiation  method  was  used  in  the  proposed  synthesizer  for  its 
simplicity. 

The  lumped-element  network  representation  for  the  vocal  system  is  shown  in 
Figure  2-11.  The  sum  of  the  acoustic  volume  velocities  radiated  from  the  mouth  and 
from  the  nostril  is  used  to  calculate  the  sound  pressure. 


Numerical  Method  for  Solving  System  Equations 
After  formulating  the  mathematical  model  of  the  vocal  system  and  deciding 
the  parameters  of  the  model,  the  next  step  addressed  is  the  solving  of  the  ordinary 
differential  equations  with  initial  conditions.  Since  present  analytical  techniques  are 
not  powerful  enough  to  solve  these  equations,  numerical  techniques  are  used  to 
obtain  approximate  solutions. 

There  are  many  methods  proposed  for  solving  ordinary  differential  equations. 
These  methods  can  be  divided  into  two  categories:  methods  for  solving  non-stiff 
equations  and  methods  for  solving  stiff  equations  [Gupta  et  al.,  1985].  Stiff 
equations  are  characterized  by  their  Jacobian  having  widely  separated  eigenvalues 
and  having  some  eigenvalues  with  negative  real  parts  of  the  large  modulus  [Gear, 
1971].  A fundamental  numerical  problem  which  severely  limits  the  usefulness  of 
many  computer  simulation  programs  is  the  stiff  equation  problem.  Since  the 
acoustic  equations  of  the  vocal  system  are  stiff  equations  (discussed  later  in  the  text), 
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Figure  2-11.  The  lumped-element  network  representation  for  the  vocal 
system. 
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this  section  will  restrict  the  discussion  to  stiff  equations.  In  order  to  discuss  the  stiff 
equation  problem,  the  error  behavior  of  numerical  methods  must  be  considered  first. 


Error  Behavior  of  Numerical  Methods 
Let 

y'  = y(to)  = yo 

be  the  differential  equation  being  solved.  Assume  the  solution  values  yi,  y2,  ....  yn  at 
time  T,  2T,  ...,  nT,  respectively,  have  been  computed.  To  compute  yn+1,  the 
numerical  methods  attempt  to  solve  the  following  problem 
y'  = f(t,y),  y(nT)  = yn . 

The  numerical  methods  are  therefore  trying  to  approximate  the  curve  on  which 
(nT,  yn)  lies  — not  the  curve  on  which  (to,  yo)  lies.  Let  the  solution  curve  on  which 
(nT,  yn)  lies  be  un(t)  and  the  original  solution  curve  be  y(t).  The  local  error  in  the 
computed  solution,  yn+1,  is  now  given  by 
en+i  = un((n+l)T)  - yn+1  . 

This  is  the  error  made  by  the  numerical  method  in  one  step.  On  the  other  hand,  the 
global  error  at  any  point  is  the  total  error  in  the  computed  solution  at  that  point.  It 
shows  how  far  the  computed  solution  is  from  the  original  solution  curve 
(Figure  2-12  ).  The  global  error  at  (n+l)T  is 
gn+i  = y((n+l)T)  - yn+i  . 

Thus,  in  general,  the  global  error  can  be  interpreted  as  the  actual  error  accrued  from 
all  previous  steps. 

Since  the  local  error  at  t = (n+l)T  may  be  either  positive  or  negative,  the  global 
error  accounting  for  the  accumulation  of  local  errors  may  or  may  not  grow  with  time. 
It  is  conceivable  that  after  a few  steps  in  time,  the  local  errors  may  partially  cancel 
each  other  due  to  variations  in  their  sign.  A numerical  integration  algorithm  which 
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Figure  2-12.  The  illustration  of  local  and  global  errors. 
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has  the  desirable  property  of  having  a global  error  that  is  not  amplified  but  actually 
decreasing  with  time  is  considered  to  be  numerically  stable.  Algorithms  that  do  not 
possess  this  property  are  said  to  be  numerically  unstable  [Gear,  1971].  Clearly,  even 
if  the  local  error  is  small,  the  global  error  of  an  unstable  algorithm  will  eventually 
become  large  enough  to  make  the  resulting  solution  useless. 


Stability  of  Numerical  Methods 

To  study  the  stability  of  a formula,  one  must  first  analyze  its  performance  on 
the  following  test  problem: 

y'  = Xy,  y(to)  = yo  • 

The  analytical  solution  of  the  test  equation  is 

y(t)  = y0  e^Mo) 

so  that  for  real  X,  the  solution  grows  exponentially  if  X > 0,  but  decays  exponentially 
for  X < 0.  This  test  problem  has  been  traditionally  used  for  stability  analysis,  since 
the  analytic  expressions  describing  the  solution  produced  by  the  numerical  method 
can  be  readily  obtained.  Studying  the  behavior  of  a numerical  method  in  solving  this 
problem  is  useful  in  predicting  its  behavior  in  solving  other  problems,  since  the 
equation 

y'  = f(t,  y) 

may  be  approximated  by 

y'  = ~(y-yo)  + + Kto  ,y0)  . 

Over  a small  time  interval  (to,  to+T),  M.  may  be  approximated  by  X such  that  the 

dy 

above  equation  may  then  be  rewritten  as 
y'  = X(  y - y0  ) + F(  t0,  y0  ) . 

The  term  F(to,  yo)  rarely  affects  stability.  Thus,  the  test  problem  serves  as  a good 
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model  for  studying  the  general  case,  y'  = f(t,  y),  over  a small  interval  [to,  to+T].  If 
presented  with  a set  of  equations,  then 
Y'  = J(y-yo)  + F(t0,  yo) 

where  J is  the  Jacobian  of  function  f.  The  parameter  X.  is  then  an  eigenvalue  of  J, 
which  may  be  complex  number. 

The  behavior  of  the  solution  computed  by  the  forward  Euler  method  for  the 
test  problem  clearly  illustrates  the  unstable  problem.  For  calculating  yn+1,  the 
formula  is 

yn+l  = yn  + Tf(nT,  yn) 

= yn  + ATyn  . 

The  ratio  of  the  computed  solutions  at  (n+l)T  and  nT  is  given  by 

— = 1 + XT  , 
yn 

and  the  ratio  of  the  true  solutions  at  (n+l)T  and  nT  is 

y((n  + 1)1)  _ eM-n+1)T  _ „ 
y(nT)  ' 7*“ 

Since  1 + AT  is  a reasonable  approximation  to  eXT  except  when  AT  < -2  (provided  A.  is 
real),  the  numerical  solution  is  a good  approximation  of  the  real  solution  when 
AT>-2. 

But  for  large  negative  AT,  eXT  is  much  smaller  than  1,  while  |1+  AT|  is  greater 
than  1 . This  means  that  the  magnitude  of  the  numerical  solution  grows  while  the  true 
solution  decays;  and  therefore,  the  numerical  solution  produced  by  the  Euler  method 
for  AT  < -2  is  unstable. 

A formula  is  designated  as  stable  if  the  computed  solution  for  AT  = 0 does  not 
grow  without  bound,  while  a formula  which  produces  decaying  solutions  for  all  AT 
values  with  Re(AT)  < 0 is  termed  A-stable.  Formulas  producing  decaying  solutions 
for  all  AT  values  such  that  arg(-A)  < a,  when  AT  5^  0 are  known  as  A(a)-stable 
[Gear,  1971]. 
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Stiff  Equations 

Now  assume  that  some  \j  (eigenvalues  of  J)  are  negative  and  quite  large  in 
magnitude  in  comparison  with  the  others.  This  implies  that  some  components  of  the 
solution  will  decay  very  quickly  and  for  all  practical  purposes  these  components  may 
become  zero.  For  an  approximate  solution,  insignificant  components  do  not  need  to 
be  computed  accurately.  Suppose  that  the  Euler  method  is  used  to  solve  such  a 
problem,  where,  for  example,  X.  = -10*.  In  order  to  ensure  that  the  error 
corresponding  to  this  eigenvalue  does  not  grow,  the  step  size,  T,  must  be  less  than 
2xl0-5.  It  is  possible  to  obtain  accurate  approximations  to  the  other  solution 
components  with  step  sizes  much  greater  than  2x10'*.  In  this  case,  it  is  the  stability 
requirements  rather  than  the  accuracy  requirements  that  limit  the  step  size.  In 
general,  stiff  equations  are  characterized  by  the  Jacobian  having  widely  separated 
eigenvalues  and  having  some  eigenvalues  with  negative  real  parts  of  the  large 
modulus  [Gear,  1971]. 

It  is  important  to  discern  whether  the  acoustic  equations  of  the  vocal  system 
are  stiff  or  not.  Shampine  and  Gear  [1979]  proposed  several  ways  of  determining  if 
equations  are  stiff.  First,  if  a system  is  known  to  be  very  stable,  it  is  likely  to  be  stiff. 
Second,  if  some  of  the  variables  exhibit  a large  rate  of  change  with  respect  to  time  — 
while  other  variables  within  the  same  equations  exhibit  a slower  rate  of  change,  then 
the  governing  equations  are  likely  to  be  stiff.  Third,  a common  sign  of  stiffness  is 
that  a method  aimed  at  non-stiff  problems  proves  conspicuously  inefficient  for  no 
obvious  reason.  Thus,  all  three  of  these  methods  are  well  suited  for  determining 
whether  or  not  the  acoustic  equations  are  stiff:  (1)  the  vocal  system  is  very  stable  as 
evidenced  by  the  fact  that  the  vocal-tract  response  decays  to  zero  very  quickly  after 
glottal  excitation  stops;  (2)  the  volume  velocity  at  the  glottis  changes  much  faster 
than  that  at  the  mouth  during  the  closing  phase;  and  (3)  when  an  explicit 
Runge-Kutta  method  is  used  to  solve  the  acoustic  equations,  it  not  only  appears 
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inefficient  (requiring  5 hours  of  computation  for  synthesizing  1 second  of  speech  on 
the  Data  General  Eclipse  S/130  [Bocchieri,  1983]),  but  sometimes  results  in  an 
unstable  solution. 


Methods  for  Solving  Stiff  Equations 

To  solve  stiff  equations  efficiently,  one  must  select  an  algorithm  that  will 
allow  the  step  size  to  be  varied  over  a wide  range  of  values  and  yet  will  remain 
numerically  stable.  No  explicit  method  can  be  A(o!)-stable  [Gear,  1971].  Therefore, 
only  implicit  methods  can  be  used  to  solve  stiff  equations.  The  family  of 
Adams-Moulton  algorithms  is  generally  considered  to  be  the  best  family  of 
general-purpose  algorithms  for  solving  the  initial-value  problems  [Chua  and  Lin, 
1975].  In  fact  for  any  given  X.  < 0,  the  first-order  Adams-Moulton  algorithm 
(backward  Euler)  and  the  second-order  Adams-Moulton  algorithm  (trapezoidal)  will 
be  stable  for  any  step  size  [Gear,  1971].  Hence,  the  choice  of  step  size,  T,  for  these 
algorithms  is  only  restricted  by  accuracy  and  not  by  stability.  Many  current  network 
simulation  programs  make  use  of  only  the  backward  Euler  or  trapezoidal  algorithms 
[Chua  and  Lin,  1975]. 

Since  the  acoustic  equations  have  been  established  as  stiff  equations,  only 
implicit  methods  can  be  used.  According  to  Dahlquist  theory,  a multi-step  algorithm 
that  is  absolutely  stable  in  the  region  Re(AT)  < 0 cannot  exceed  order  2 [Chua  and 
Lin,  1975],  the  higher  order  algorithms  are  excluded  from  the  consideration.  In  the 
present  study,  the  trapezoidal  algorithm  was  chosen  to  solve  the  acoustic  equations 
since  it  is  more  accurate  then  backward  Euler  algorithm.  The  trapezoidal  algorithm 
[Gear,  1971]  for  solving  y’  = f(y)  is 

T 

yn+ 1 = + y [ f(yn)  + f(yn+i)  ] ■ 
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Simplification  Using  Associated  Discrete  Circuit  Model 
The  associated  discrete  circuit  model  [Chua  and  Lin,  1975]  is  an  advanced 
computational  technique  used  in  the  computer-aided  analysis  of  electronic  circuits. 
This  model  greatly  simplifies  the  analysis  program  of  dynamic  networks.  From  the 
numerical  integration  point  of  view,  the  differential  equation  characterizing  a 
capacitor  or  an  inductor  can  be  approximated  by  a resistive  circuit  associated  with 
the  integration  algorithm.  After  replacing  each  capacitor  and  each  inductor  by  a 
resistive  discrete  circuit  model  associated  with  a given  integration  algorithm,  the 
transient  analysis  of  a dynamic  network  can  be  transformed  into  a sequence  of  dc 
analyses  for  a resistive  network.  Obviously,  dc  analysis  is  much  easier  than  the 
transient  analysis.  The  discrete  circuit  models  associated  with  the  trapezoidal 
algorithm  are  derived  below. 

The  trapezoidal  algorithm  for  solving  the  first-order  differential  equation  v’= 
f(v)  with  a step  size,  T,  is  given  by 

vn+i  = vn  + T[f(vn+1)  + f(vn)]/2 

= vn  + Tv’n+1/2  + Tv’n/2  . (2-1) 

For  a linear  capacitor,  the  relationship  is 
v'(t)  = i(t)/C 

then, 

V n+l  = in+l/C  (2—2) 

v'n  = i„/C  . (2-3) 

Substituting  Equations  (2-2)  and  (2-3)  into  Equation  (2-1),  then  solving  for  in+1 
yields 

in+i  = 2 Cvn+1/T  - (2Cvn/T  + in)  . (2-4) 

Equation  (2-4)  can  be  represented  by  the  equivalent  linear  circuit  shown  in  Figure 
2-13.  This  circuit  is  therefore  the  discrete  circuit  model  for  a linear  capacitor 
associated  with  the  trapezoidal  algorithm. 
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Figure  2-13.  Discrete  circuit  model  associated  with  trapezoidal  algorithm 
for  a linear  capacitor. 
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For  a linear  inductor, 
i'(t)  = v(t)/L 

and  following  the  previous  procedure,  one  obtains 

in+i  = Tvn+1/(2L)  + (Tvn/(2L)  + in)  . (2-5) 

The  discrete  equivalent  circuit  for  Equation  (2-5)  is  shown  in  Figure  2-14  which  is 
the  model  for  a linear  inductor  associated  with  the  trapezoidal  algorithm. 

In  order  to  simplify  the  acoustic  equations,  the  equivalent  circuits  of  the 
articulatory  synthesizer  are  first  transformed  into  a resistive  network  by  using  the 
associated  discrete  circuit  models  (Figure  2-15).  Then,  a set  of  linear  algebraic 
equations  which  describes  the  resistive  network  is  set  up  by  using  circuit  analysis 
techniques.  Finally,  the  resulting  set  of  linear  algebraic  equations  with  a sparse 
coefficient  matrix  are  enough  to  describe  the  acoustic  property  of  the  vocal  tract. 


Practical  Realization 

Two  programs,  ART_MOD  and  ART_SYN,  are  used  to  implement  the 
articulatory  speech  synthesizer.  Program  ART_MOD  is  an  “interactive  graphics 
editor”  which  is  implemented  on  a Tektronix  4113  graphics  terminal  interfaced  with 
a VAX-11/750  minicomputer.  Using  this  graphics  editor,  a user  can  alter  the 
configuration  of  the  articulatory  model  by  means  of  keyboard  strokes  and  thumb 
wheels.  In  this  way,  the  desired  articulatory  configuration  can  be  easily  specified  by 
the  user.  Commands  that  define  the  position  of  the  vocal  organs  of  the  articulatory 
model  are 

CRT  — to  change  the  tongue  radius, 

CTT  — to  change  the  center  of  tongue  tip, 

CTN  — to  change  the  center  of  tongue  body, 

CVL  — to  change  the  velum, 

CLP  — to  change  the  lips, 
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Figure  2-14.  Discrete  circuit  model  associated  with  trapezoidal  algorithm 
for  a linear  inductor. 
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Figure  2-15.  Symplifying  the  equivalent  circuits  of  an  element  of  the 
vocal  tract  by  using  associated  discrete  circuit  models  and  circuit 
analysis  techniques. 
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CJW  — to  change  the  jaw, 

CRT  — to  change  the  tongue  radius, 

MOD  — to  modify  the  “fixed  structure”  (such  as  the  hard  palate,  etc.). 
The  system  also  allows  a user  to  store  (STO  command)  the  displayed  vocal  tract 
shape  in  a disk  file  as  a data  base  which  can  be  later  read  (REA  command)  from  that 
data  base  for  editing.  Since  speech  is  a dynamic  process  in  which  the  vocal  tract 
shapes  change  with  time,  a SWT  command  is  included  to  attach  a time  label  to  the 
displayed  vocal  tract  configuration.  Frames  defined  by  the  SWT  command  are  the 
key  frames  that  must  be  reached  at  the  indicated  time.  Command  AFI  generates  a 
file  which  contains  the  information  of  the  area  function  (in  sixty  sections)  used  by  the 
ART_SYN  program.  To  achieve  a continuous  and  a smooth  variation  of  the  area 
function,  the  AFI  command  linearly  interpolates  the  area  function  between 
consecutive  key  frames. 

The  main  program,  ART_SYN,  lets  a user  specify  the  input  and  output 
filenames,  the  number  of  sections  in  the  acoustic  transmission  line,  and  the  step  size 
of  the  trapezoidal  method.  The  two  input  files  comprise  the  area  function  file  and  the 
glottal  source  file,  while  the  two  output  files  comprise  a speech  file  and  a file  which 
contains  volume  velocities  and  pressures  at  the  glottis,  mouth,  and  nostrils. 
ART_SYN  “calls”  four  subprograms  — NASAL,  TRACT,  FOLDS,  and  ACOUST. 
NASAL  is  called  once  to  specify  a fixed  nasal  tract  shape.  The  user  can  specify  the 
nasal  tract  area  function  in  a file  named  ’NASAL.PRM’;  if  this  file  does  not  exist, 
NASAL  uses  the  default  values.  FOLDS  is  called  for  each  sample  interval;  it  reads 
the  data  from  a glottal  source  file  and  linearly  interpolates  that  data  to  give  the  glottal 
area  and  subglottal  pressure  at  the  specified  time.  TRACT  reads  the  data  from  a file 
containing  the  area  functions  of  the  vocal  tract  (in  sixty  sections),  reduces  the  area 
functions  to  the  number  specified  by  the  user,  and  interpolates  them  linearly.  In 
addition,  when  TRACT  is  called  the  first  time,  it  looks  for  the  files  named 
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’LENSCAL.PRM’  and  ’ARESCAL.PRM’  in  which  the  user  can  opt  to  specify  length 
and  area  scale  factors  for  every  section.  Subprogram  ACOUST  constructs  the 
acoustic  equations  and  calls  the  subprogram  SOLVE  to  solve  them.  When  ACOUST 
is  called  the  first  time,  it  looks  for  the  following  parameter  files:  (1) 
’GLOTTIS. PRM’— which  contains  the  specifications  regarding  the  length  and  the 
thickness  of  the  vocal  folds;  (2)  ’NOISE.PRM’— which  specifies  the  critical  Reynolds 
number  and  the  gain  of  the  glottal  noise  source;  (3)  ’YIELDWALL.PRM’ — which 
specifies  the  resistance,  mass,  and  stiffness  per  unit  area  of  the  vocal  tract;  and  (4) 
’SINUS.PRM’  — which  specifies  the  physical  parameters  of  the  sinuses  in  the  nasal 
tract.  By  taking  advantage  of  all  these  parameter  files,  the  user  may  easily  change 
the  model  parameters.  A block  diagram  of  program  ARTJSYN  is  shown  in  Figure 
2-16. 

There  are  mainly  three  articulatory  synthesizers,  namely  Flanagan’s 
[Flanagan  et  al.,  1975],  Maeda’s  [1982a],  and  Bocchieri’s  [1983],  which  model  the 
vocal  system  by  a set  of  differential  equations.  In  comparison  with  these  synthesizers 
(see  Table  2-2),  the  proposed  articulatory  synthesizer  has  the  following  features. 

(1)  A passive  one-mass  model  was  used  to  simulate  the  vocal  folds.  This 
model  is  simpler  than  the  self-oscillating  model.  The  computational  burden  is 
reduced  through  the  elimination  of  the  nonlinear  mechanical  properties  of  the  vocal 
folds.  This  model  is  controlled  by  the  glottal  area  function,  so  that  the  pitch  contour 
of  the  synthetic  voice  is  precisely  controlled  by  the  user.  Maeda  simulated  the  vocal 
folds  as  a slot  with  a varying  area,  which  did  not  include  the  contraction  and 
expansion  at  the  glottis,  consequently,  the  source-tract  interaction  cannot  be 
correctly  simulated. 

(2)  Since  the  acoustic  equations  were  identified  as  stiff  equations,  the 
trapezoidal  method  was  used.  This  numerical  method  made  our  articulatory 
synthesizer  stable. 
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Figure  2-16.  Block  diagram  of  ARTJSYN. 
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Table  2-2.  Comparison  of  articulatory  synthesizers. 


Flanagan 

Maeda 

Bocchieri 

Ding 

model  of 
vocal  folds 

self-oscillating 
two-mass  model 

a slot 

self-oscillating 
two-mass  model 

passive 

one-mass  model 

noise  source 
at  the  glottis 

No 

No 

Yes 

Yes 

source  parameters 

Ago,  Q,  Ps 

Ap,  fo,  Ps 

Ago,  Q,  Ps 

Ago,  Agamp,  Qq 
Qs.  fo.  Ps 

noise  source 
in  the  tract 

every  section 

No 

at  constriction 

at  constriction 

viscous  and 
thermal  losses 

Yes 

No 

Yes 

Yes 

tract  parameters 

A(x) 

A(x) 

A(x)  or 
position  of 
articulators 

A(x)  or 
position  of 
articulators 

method  of 
simplification 

from  equations 
to  equations 

from  equations 
to  equations 

from  equations 
to  equations 

circuit 

models 

numerical 

method 

backward 

Euler 

integration 
by  midpoint  rule 

Runge-Kutta 

backward 

trapezoidal 

method  of  change 
model  parameters 

re-compile 
source  code 

re-compile 
source  code 

re-compile 
source  code 

through 
parameter  files 
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(3)  By  using  the  associated  discrete  circuit  models  and  circuit  analysis 
methods,  the  computational  burden  was  greatly  reduced.  This  made  our  synthesizer 
very  efficient. 

(4)  There  are  seven  parameter  files  which  are  read  in  by  the  synthesizer 
during  initialization.  The  user  may  change  the  model  parameters,  such  as  area 
function  of  nasal  tract,  by  editing  these  files.  So  this  synthesizer  is  very  flexible  for 
different  research  purposes. 


CHAPTER  3 

DERIVING  CONTROL  PARAMETERS 
Introduction 

To  produce  high-quality  synthetic  speech,  a good  synthesizer  is  not  enough.  It 
is  necessary  to  supply  the  synthesizer  with  a sequence  of  control  parameters  which 
are  appropriate  to  the  details  of  the  required  utterance.  In  general,  there  are  two 
methods  for  deriving  the  control  parameters  of  a speech  synthesizer.  One  method 
extracts  control  parameters  from  the  original  speech  by  using  analysis  algorithms 
which  depend  upon  the  specific  synthesis  scheme.  The  other  method  specifies 
parameters  by  applying  rules  to  the  input  phonetic  descriptions.  The  former 
approach  is  used  in  analysis-synthesis  systems  like  vocoders.  Analysis-synthesis 
systems  can  reproduce  speech  with  such  a natural  quality  that  the  identity  of  a 
particular  speaker  can  be  recognized.  The  latter  approach  is  used  in 
synthesis-by-rule  systems  [Klatt,  1987].  The  main  consideration  of  such  systems  is 
intelligibility.  For  an  articulatory  speech  synthesizer,  there  is  no  analysis  algorithm 
that  can  reliably  derive  useful  articulatory  parameters  from  the  speech  signal. 
Therefore,  the  control  parameters  of  an  articulatory  synthesizer  are  derived 
according  to  rules.  The  movements  of  the  vocal  folds  and  the  articulatory  organs 
involved  in  the  production  of  speech  are  so  complex  that  it  is  impossible  to  describe 
them  in  every  detail.  However,  for  applications  where  the  intelligibility  of  the 
synthetic  speech  is  a primary  factor  of  consideration,  many  of  these  details  are  of 
minor  importance  and  serve  to  simplify  the  specification  of  control  parameters. 

The  control  parameters  of  the  articulatory  synthesizer  in  the  present  study  fall 
into  two  categories:  (1)  source  parameters  which  involve  the  subglottal  pressure,  the 
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fundamental  frequency,  and  the  glottal  area  wave  shape;  and  (2)  tract  parameters 
which  involve  positions  of  the  articulators  or  the  cross-sectional  area  function. 
While  the  intelligibility  of  synthetic  speech  is  more  closely  aligned  to  the  movements 
of  the  articulators  than  to  the  exact  movement  characteristics  of  the  vocal  folds,  the 
voice  quality  and  the  personal  characteristics  of  natural  speech  are  more  closely 
related  to  the  source  parameters  [Fant,  1981;  Rosenberg,  1971].  The  main  concern 
of  this  chapter,  then,  is  to  determine  how  one  can  estimate  plausible  values  for  these 
parameters. 


Tract  Parameters 

The  positions  of  the  articulators  or  the  cross-sectional  area  function  of  the 
vocal  tract  are  the  major  parameters  for  the  articulatory  speech  synthesizer.  Several 
approaches  to  obtain  these  parameters  are  reviewed  as  follows. 

X-ray  photography.  In  the  past  the  most  common  method  for  observing  the 
movements  of  the  articulators  was  by  high-speed  cineradiography.  The  study  by 
Perkell  [1969]  is  a well  known  representation  of  this  early  methodology.  He  treated  a 
certain  number  of  systematic  speech  samples,  each  yielding  a monograph.  Although 
this  study  provided  valuable  insights  into  the  characteristics  of  the  articulator 
movements,  this  kind  of  research  is  essentially  limited  by  the  radiation  dosage 
problem  and  by  the  large  amount  of  time  required  for  the  measurement  and  analysis 
of  the  data.  The  X-ray  microbeam  method  [Kiritani,  1986]  which  employs  an  X-ray 
microbeam  generator  with  an  on-line  computer  was  developed  in  an  effort  to  reduce 
the  restrictions  of  cineradiography.  For  the  observation  of  the  articulatory 
movements,  metal  pellets  are  attached  to  the  surface  of  the  articulators  with  the 
deflection  of  the  X-ray  microbeam  controlled  by  the  computer  to  track  the  position 
of  the  pellets.  One  drawback  to  the  microbeam  system  is  that  it  cannot  accurately 
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track  the  position  of  the  pellets  when  the  subjects  have  dental  fillings  or  metallic 
caps. 

The  sagittal  distance  of  the  vocal  tract  can  be  directly  measured  from  the 
X-ray  films,  but  the  cross-sectional  area  function  is  the  desired  parameter.  Several 
methods  have  been  used  for  estimating  the  cross-sectional  area  function  from 
radiographs  of  the  vocal  tract  in  the  lateral  view.  For  the  mouth  cavity,  estimates  of 
the  cross-sectional  areas  can  be  made  from  the  radiographs  in  the  lateral  view  by 
means  of  plaster  casts  of  the  hard  palate  [Fant,  I960].  For  the  pharynx,  the  best 
approach  is  computer  tomography  [Johansson  et  al.,  1983]. 

Inverse  transform.  The  most  recent  method  to  derive  the  cross-sectional  area 
function  directly  from  acoustic  measurements  is  by  an  inverse  transform.  In  this 
situation  the  area  function  estimation  methods  can  be  divided  into  three  classes. 

In  the  first  class,  the  area  function  is  reconstructed  from  the  acoustic 
information  measured  by  way  of  an  impedance  tube  placed  against  the  lips  [Sondhi 
and  Gopinath,  1971;  Sondhi  and  Resnick,  1983].  This  method  requires  the  subject  to 
articulate  speech  sounds  without  phonating  with  his  lips  pressed  firmly  against  the 
impedance  tube. 

The  second  class  uses  an  approach  which  exploits  the  relationship  between 
the  LPC  coefficients  and  the  cross-sectional  areas  of  the  uniform  cylindrical  sections 
of  an  acoustic  tube  [Wakita,  1973;  1979].  This  approach  involves  a simple 
experimental  procedure  and  allows  articulatory  estimates  to  be  made  for  natural 
utterances  obtained  under  normal  speaking  conditions.  Unfortunately,  the  area 
function  estimated  from  the  LPC  coefficients  is  not  guaranteed  to  be  physiologically 
plausible,  since  the  model  assumes  (1)  no  source-tract  interaction,  (2)  a lossless 
hard-walled  tract,  and  (3)  no  nasal  tact.  Additionally,  the  estimated  area  function  is 
not  unique  and  depends  upon  the  assumed  boundary  conditions  [Charpentier,  1984; 
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Sondhi,  1979].  This  drawback  limits  its  use  in  speech  research  as  a substitute  for  the 
cineradiographic  techniques. 

The  third  class  involves  an  approach  whose  main  objective  is  to  collect  enough 
acoustic  information  such  that  the  ambiguities  present  in  the  second  approach  can  be 
resolved  [Milenkovic,  1984].  To  apply  this  method  to  human  subjects,  one  must 
solve  two  seemingly  inherent  problems.  First,  the  admittance  values  of  the  radiation 
load  need  to  be  calibrated  for  variations  of  the  field  pattern  with  the  lip  opening. 
Second,  a method  of  measuring  the  acoustic  pressure  inside  the  pharynx  cavity  needs 
to  be  designed. 

Analvsis-bv-svnthesis.  Recently,  Schroeter  et  al.  [1987]  proposed  an 
analysis-by-synthesis  method  to  derive  articulatory  parameters  from  the  input 
speech  signal.  Their  goal  was  to  re-synthesize  the  original  speech  by  the  hybrid 
time-frequency  domain  articulatory  speech  synthesizer  [Sondhi  and  Schroeter, 
1987].  In  order  to  provide  better  starting  values  for  the  vocal  tract  configurations,  a 
codebook  of  the  tract  shapes  is  searched  exhaustively.  Based  on  this  guide,  the  tract 
shape  is  optimized  by  using  a distance  measure  that  is  sensitive  to  changes  in  the 
tract  shape  but  insensitive  to  changes  in  glottal  parameters.  Obviously,  there  are 
tremendous  computations  involved. 

Trial-and-error.  The  articulatory  synthesizer  can  produce  speech  with  high 
quality,  provided  the  experimenter  can  properly  adjust  the  control  parameters  of  the 
synthesizer  after  the  trial  runs.  A continuous  process  of  feedback  and  adjustment  is 
necessary  until  the  desired  quality  is  achieved.  Fant  [1985]  proposed  a fast 
algorithm  for  estimating  the  log-magnitude  transfer  functions  from  vocal  tract 
configurations.  The  first  estimation  of  the  vocal  tract  configuration  is  from  such 
sources  as  X-ray  photographs,  and  conceptual  ideas  from  phonetics  and  linguistics. 
Then,  the  log-magnitude  transfer  function  is  calculated  by  using  this  algorithm  and 
compared  to  the  target  spectrum.  If  these  two  spectra  do  not  correspond  with  each 
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other,  adjustment  of  the  vocal  tract  configuration  is  necessary.  Since  calculating  the 
log-magnitude  transfer  function  is  much  faster  than  synthesizing  the  sound,  the 
amount  of  time  required  for  the  adjustment  of  the  control  parameters  can  be  greatly 
reduced  by  using  this  algorithm.  The  derivation  of  this  algorithm  is  as  follows. 
Assuming  the  complex  transfer  function  of  the  lossy  tract  is 

H = 

Na  + jN„ 

then,  the  log-magnitude  envelope  of  the  transfer  function  is 
L(f)  = -10(loglo(A/a2  + Nb2))  . 

To  obtain  L(f),  two  steps  are  needed.  First,  Nb  should  be  calculated  based  on 
the  lossless  tract.  Then,  Na  should  be  estimated  based  on  Nb  with  the  approximate 
average  bandwidth  function,  B(j)  (discussed  later  in  this  section). 

An  element  of  lossless  tube  with  a constant  cross-sectional  area  A and  a 
length  d can  be  represented  by  the  lumped-element  circuits  [Badin  and  Fant,  1984] 
shown  in  figure  3-1.  The  series  and  shunt  elements  are 
a = Ztgh(0/2) 
b = Z/sinh0 

where 

Z = pc/A, 

0 = jwd/c, 

p is  the  density  of  air, 
c is  the  velocity  of  sound,  and 
o)  is  the  frequency  of  the  sound. 

The  basic  input-output  equations  for  the  pressure  P and  the  flow  U at  the  terminals  of 
the  element  transmission  line  satisfy  the  following  mathematical  relationships: 


Pi  = Ui  Zj  cothQi  - Ui-i  Zi  / sinhQi 
Pi-i  = Ui  Zi  / sinhQi  - Ut-i  Z,  cothQi 


(3-1) 

(3-2) 
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Figure  3-1.  Network  module  of  a very  short  tube. 
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Dividing  Equation  (3-2)  by  and  denoting  L±  as  Z, yields 

Ui.  i 


'i-i 


i-1  A 


(3-3) 


while  dividing  Equation  (3-1)  by  Ut  yields 


sinh  0, 


(3-4) 


At  a specified  frequency,  the  recursive  routine  for  obtaining  the  value  of  the  transfer 
function  is 

(1)  to  specify  the  radiation  load,  Z0i0,  as  the  right-hand  side  loading 
impedance, 

(2)  to  let  i = 1 and  Uin/Uout  = 1, 

(3)  to  calculate  the  flow  transfer  Ui/Ui-j  using  Equation  (3-3), 

(4)  to  calculate  the  input  impedance  ZM  using  Equation  (3-4), 

(5)  to  update  to  EzlJIl  , 

Uout  U Old  U i-1 

(6)  to  make  i = i+1, 

(7)  to  go  to  step  3,  if  i is  less  than  the  total  number  of  sections,  otherwise  set 


Fant  [1985]  pointed  out  that  the  approximate  relationship  between  the 
average  bandwidth  and  the  frequency  is 

B(f)  = 15(500/f)2  + 20sqrt(f/500)  + 2(f/500)2 
and  that  Na(f)  can  be  estimated  from 


Fant  listed  the  five  first  formant  frequencies  of  Russian  vowels  estimated  by  using 
this  algorithm  in  his  paper  [Fant,  1985],  and  they  are  very  close  to  the  corresponding 
real  data  [Fant,  I960].  An  example  of  transfer  function  calculated  by  using  this 
algorithm  is  shown  in  Figure  3-2. 


Nb  = Uout/Uin. 
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Figure  3-2.  Example  of  transfer  function  calculated  from  area  function. 
Above:  Vocal  tract  area  function  of  Id. 

Bottom:  The  corresponding  transfer  function. 
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Since  the  X-ray  photographic  method  needs  special  equipment,  we  were 
unable  to  use  this  approach  for  our  research.  The  area  functions  generated  by  using 
inverse  transform  methods  may  not  be  match  the  real  vocal  tract  area  functions  and 
may  not  be  unique,  so  this  method  was  not  useful  for  this  research.  There  is  no 
theory  that  tells  us  how  to  manipulate  the  area  function  to  match  a target  spectrum. 
Consequently,  the  analysis-by-synthesis  method  becomes  equivalent  to  the 
trial-and-error  method.  Therefore,  the  trial-and-error  method  is  used  in  this 
research.  With  the  help  of  the  Fant’s  algorithm,  the  tedious  calculations  can  be 
somehow  reduced. 


Source  Parameters 

It  is  well  established  that  the  quality  of  a synthetic  vowel  is  influenced  by  the 
glottal  waveform  [Rosenberg,  1971].  In  order  to  synthesize  high-quality  speech, 
several  aspects  of  the  glottal  source  dynamics  should  be  simulated  by  the  source 
model.  They  are  (1)  the  pitch  and  its  variations,  (2)  the  glottal  pulse  shape  and  its 
variations,  (3)  the  underlying  temporal  patterns  of  vocal  fold  positioning 
(abduction/adduction),  and  (4)  the  relative  intensity  variations  of  speech  sounds 
[Ananthapadmanabha,  1984] . While  the  first  three  aspects  are  related  to  glottal  area 
function,  the  last  aspect  is  primary  related  to  the  subglottal  pressure. 

Glottal  Area  Function 

The  glottal  area  function  is  another  important  control  parameter  for  the 
articulatory  synthesizer.  It  controls  the  pitch  contour,  the  glottal  volume  velocity 
waveform,  and  the  temporal  patterns  of  the  vocal  fold  positioning 
(abduction/adduction).  The  pitch  contour  is  related  to  the  intonation  patterns,  while 
the  excitation  waveshape  corresponds  to  the  personal  characteristics  of  natural 
speech.  Abduction/adduction  is  an  extremely  important  activity  of  the  vocal  folds 
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during  the  production  of  speech.  These  actions  occur  during  voice  onset  (adduction), 
voicing  decay  (abduction),  and  in  the  production  of  stops  and  other  consonants 
(abduction).  Thus,  the  glottal  area  function  becomes  a key  parameter  corresponding 
to  the  naturalness  of  the  synthetic  speech.  Several  approaches  to  obtaining  the  glottal 
area  function  follow. 

High-speed  laryngeal  photography.  High-speed  laryngeal  photography 
[Childers  et  al.,  1980]  provides  an  estimate  of  the  glottal  area.  This  technique 
involves  the  placing  of  a mirror  in  the  pharynx  to  direct  a high-intensity  light  beam 
onto  the  vocal  folds  as  a subject  articulates.  The  mirror  also  serves  to  reflect  the 
image  of  the  folds  to  the  lens  of  a high-speed  motion  picture  camera.  A 
frame-by-frame  analysis  of  these  motion  pictures  may  be  performed  to  provide  a 
sampling  of  the  area  of  the  glottis  during  vocal  fold  vibration,  but  this  procedure 
requires  that  the  mouth  of  the  subject  be  kept  open. 

Photoglottography.  The  method  of  photoglottography  was  originally 
developed  by  Sonesson  [I960].  In  this  method,  a light  source  is  placed  on  the 
external  neck  wall  slightly  below  the  level  of  the  vocal  folds.  As  a subject  produces 
voicing  sounds,  the  transmitted  light  passing  through  the  glottis  is  sensed  by  a 
multiplier  photo-tube  placed  in  the  larynx.  The  multiplier  photo-tube  which  is 
connected  to  a cathode  ray  oscilloscope  produces  a curve  corresponding  to  the 
vibration  of  the  vocal  folds  as  shown  on  the  screen  [Kitzing  and  Sonesson,  1974]. 
Although  high-speed  films  are  most  commonly  used  to  monitor  the  details  of  the 
glottal  cycle,  this  technique  is  difficult,  expensive,  and  cannot  be  performed  under 
natural  conditions  because  of  the  need  to  use  a laryngeal  mirror.  In  contrast, 
photoglottography  can  be  performed  more  easily  and  under  more  natural  conditions, 
including  natural  speech. 

Model  of  the  glottal  area  function.  The  glottal  area  function  generally 
depends  on  the  phonetic  context,  the  phonatory  mode,  the  voice  intensity,  the  pitch, 
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etc.  However,  these  relationships  are  not  very  well  known.  Ananthapadmanabha 
and  Fant  [1982]  have  proposed  a reasonable  model  of  the  glottal  area  functions  in 
the  modal  voice  (Figure  3-3).  The  mathematical  expression  of  the  model  is 

Ag(t)  = [0.5  - 0.5  cos(;r-^-)]  0 < t < T0 

*0 

Agif)  — •'4max  COS(?T  ^ j.  ) Tq  < t < (Tq  + 7"c) 

A(t)  =0  (T0  + Tc)  < t < T . 

The  four  parameters  of  this  model  are  (1)  the  pitch  period  T,  (2)  the  duration  of 
opening  phase  T0,  (3)  the  duration  of  closing  phase  Tc,  and  (4)  the  maximum  area 

Amax- 

The  model  of  the  glottal  area  function  was  modified  to  also  include  breathy 
voice.  The  mathematical  expression  of  the  proposed  model  of  the  glottal  area 
function  is 

A(0  = Agamp  [0.5-0.5cos(jr-^— )]  + A#  0 < t < Q°Q*  T 

i+Qs T 1 + 

Ag(t ) = Aganp  cos(jr 1 ^ ) + Ago  T~%7t  < 1 < Q*T 

2 i+Oi  T 1 + Us 

Ag{0  A^  Q0T  < t < T . 


where 

Agamp  is  the  amplitude  of  glottal  area  function, 
i4g0  is  the  minimum  glottal  area, 

T is  the  pitch  period, 

Qo  is  the  open  quotient  (ratio  of  duration  of  open  phase  to  the  pitch  period), 
Qs  is  the  speed  quotient  (ratio  of  duration  of  opening  phase  to  duration  of 
closing  phase). 
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Figure  3-3.  The  model  of  the  glottal  area  function  of  Anathapadmanabha 
and  Fant  [1982]. 
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The  advantages  of  this  model  are  (1)  not  only  the  modal  voice,  but  the  breathy  voice, 
can  be  simulated  since  the  minimum  glottal  area  can  be  greater  then  zero,  and  (2)  the 
specification  of  the  parameters  is  simplified  because  changing  the  pitch  period  only 
does  not  affect  the  shape  of  glottal  area  function.  To  apply  this  model,  one  must 
assign  the  five  parameters  properly. 

The  pitch  period,  which  is  the  most  important  parameter  related  to  speech 
quality,  can  be  estimated  from  original  speech.  Pitch  estimation  is  a basic  speech 
analysis  problem.  A wide  variety  of  the  time  and  frequency  domain  algorithms  for 
pitch  detection  from  the  speech  signal  have  been  developed.  The  two  typical 
algorithms  are  the  modified  autocorrelation  method  which  utilizes  clipping 
(AUTOC)  [Dubnouski  et  al.,  1976]  and  the  cepstrum  method  (CEP)  [Schafer  and 
Rabiner,  1970].  The  AUTOC  algorithm  is  based  on  the  autocorrelation  function  of 
the  center  clipped  speech.  In  general,  the  autocorrelation  function  has  a prominent 
peak  at  the  pitch  period.  The  spurious  peaks  in  the  autocorrelation  function  resulting 
from  the  vocal  tract  response  may  cause  difficulty  in  picking  the  peak  at  the  pitch 
period.  The  center  clipping  removes  the  spurious  peaks,  while  retaining  the  pitch 
periodicity.  Since  the  autocorrelation  function  is  essentially  phase  insensitive,  this 
method  provides  a good  estimate  of  the  pitch  period  in  cases  where  the  waveform 
was  unreliable  due  to  the  phase  distortion.  The  CEP  algorithm  is  based  on  the 
cepstrum  of  the  speech  signal.  The  cepstrum  of  the  voiced-speech  segment  exhibits 
a peak  corresponding  to  the  pitch  period.  However,  the  CEP  algorithm  suffers  from 
the  inability  to  reliably  decide  whether  a speech  segment  is  voiced  or  unvoiced.  The 
presence  of  a low  level  peak  is  not  a reliable  indication  of  the  analyzed  speech 
segment  being  unvoiced. 

Both  methods,  AUTOC  and  CEP,  are  dependent  upon  the  window  size  and 
obtain  the  average  pitch  period  of  the  segment  within  the  window.  Since  a real 
speech  waveform  varies  both  in  pitch  period  and  in  waveshape  within  a period,  the 
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speech  waveform  is  quasi-periodic.  This  quasi-periodic  problem  can  be  resolved,  if 
an  “exact”  location  of  the  beginning  and  the  end  of  each  period  can  be  defined. 
Therefore,  the  pitch  contour  is  determined  on  a period-by-period  basis. 

The  electroglottograph  (EGG)  [Childers  and  Krishnamurthy,  1985]  provides  a 
means  by  which  the  location  of  the  beginning  of  each  pitch  period  is  defined.  In  a 
comparison  of  the  EGG  with  the  glottal  area  obtained  from  synchronized  ultra-high 
speed  laryngeal  films,  Krishnamurthy  [1983]  arrived  at  several  conclusions:  (1)  the 
electroglottograph  is  an  excellent  indicator  of  the  vibratory  period  of  the  vocal  folds; 
and  (2)  the  maximum  in  the  differentiated  EGG  (DEGG)  is  very  close  to  the  instant 
when  the  vocal  folds  begin  to  open  on  the  superior  surface;  while  the  minimum  in 
DEGG  almost  coincides  with  the  closing  instant  (Figure  3-4). 

Both  speech  and  DEGG  are  used  in  this  research  to  estimate  the  pitch  period, 
the  open  quotient  and  the  amplitude  of  the  glottal  area.  As  a first  estimation,  the 
amplitude  of  the  glottal  area,  A is  proportional  to  the  square  root  of  the  wide  band 
energy  of  the  speech.  A typical  value  of  A for  stressed  vowels  is  approximately  0.2 
cm2  [Ishizaka  and  Flanagan,  1972]. 


Subglottal  Pressure 

Several  techniques  for  measuring  the  subglottal  pressure  during  phonation 
have  been  reported.  These  techniques  include  the  tracheal  puncture,  the  transglottal 
approach,  and  the  measurement  of  the  intraesophageal  pressure.  Tracheal  puncture 
is  a straight  forward,  but  invasive  method.  In  the  transglottal  approach  [Koike  and 
Perkins,  1968],  measurement  is  achieved  by  placing  a small  pressure  transducer  in 
the  subglottal  cavity  through  the  glottis.  While  this  method  is  non-invasive,  the 
placement  of  the  transducer  needs  a trained  hand  of  a laryngologist.  The 
measurement  of  the  intraesophageal  pressure  [van  den  Berg,  1956]  is  an  indirect 
method  in  which  an  air-balloon  with  a tube  is  first  swallowed  into  the  esophagus  with 
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Figure  3-4.  Waveforms  of  the  EGG,  differentiated  EGG,  and  glottal  area 
function,  (after  Krishnamurthy  and  Childers  [1986]) 
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the  pressure  inside  the  balloon  being  recorded  through  the  tube.  The  subglottal 
pressure  is  calculated  using  the  value  of  the  intraesophageal  pressure.  The 
relationship  between  these  two  values,  however,  is  reported  to  be  affected  by  several 
factors. 

The  subglottal  pressure  is  estimated  from  the  speech  signal  in  this  research 
effort  because  it  is  the  primary  factor  involved  in  raising  voice  intensity.  It  has  been 
documented  that  a doubling  of  the  subglottal  pressure  results  in  a rise  of  the  overall 
sound  pressure  level  by  about  9 db  [Ladefoged  and  McKinney,  1963;  Isshiki,  1964]. 
With  an  increase  in  the  mean  subglottal  pressure,  the  maximum  glottal  area 
increases  exhibiting  a gain  of  3 db  in  the  flow.  Furthermore,  the  pitch  period  and, 
especially,  the  duration  of  the  closing  phase  decreases  with  an  increase  in  the  rate  of 
the  change  in  the  glottal  area  at  closure.  All  these  factors  added  together  create 
another  3 db  gain  [Fant,1982].  Based  on  these  facts,  the  subglottal  pressure  is 
considered  to  be  proportional  to  the  energy  of  speech  for  vowels.  For  consonants,  the 
value  of  the  subglottal  pressure  should  be  estimated  to  be  larger  than  this,  since  a 
pressure  drop  at  the  constriction  in  the  vocal  tract  causes  a decrease  in  the  intensities 
of  consonant  sounds. 

There  is  no  easy  way  to  obtain  the  control  parameters  for  the  articulatory 
synthesizer.  An  example  of  the  control  parameters  for  this  research  is  shown  in 
appendix. 


CHAPTER  4 

SYNTHESIZING  NASALS 


While  Chapters  2 and  3 investigated  the  development  of  a new  articulatory 
speech  synthesizer  and  its  control  parameters,  this  chapter  discusses  the  use  of  this 
synthesizer  in  the  production  of  nasal  sounds.  Experiments  whose  objectives  are  to 
gain  insight  into  the  correlations  between  the  quality  of  the  synthetic  nasal  sounds 
and  the  control  parameters  of  the  synthesizer  are  also  presented.  This  discussion  (1) 
provides  a background  on  the  production  of  nasal  sounds  and  its  associated  acoustic 
properties;  (2)  presents  the  difficulties  in  synthesizing  nasal  sounds;  (3)  provides  the 
basis  for  selecting  appropriate  parameters  by  examining  the  structure  of  the  nasal 
tract  and  the  movement  of  the  velum;  and  (4)  presents  the  results  of  the  experiments 
in  which  nasal  sounds  were  synthesized. 

Nasal  Production 

There  are  three  sounds  in  the  English  language,  namely  /m/,  Ini  and  /ng/, 
which  require  resonance  in  the  nasal  cavities;  and  therefore,  are  defined  as  nasal 
consonants.  Nasal  consonants  are  produced  by  a lowering  of  the  velum  which,  in 
turn,  allows  air  to  flow  through  the  nasal  tract  with  the  oral  cavity  simultaneously 
occluded  in  one  of  three  ways.  For  /ml,  the  lips  are  closed.  Thus,  the  sound  from  the 
vocal  folds  is  resonated  not  only  in  the  pharyngeal  cavity  and  in  the  closed  oral  cavity, 
but  in  the  spacious  chambers  of  the  nasal  cavities  as  well.  The  alveolar  nasal  Ini  and 
the  palatal  nasal  /ng/  are  produced  in  a manner  analogous  to  the  bilabial  nasal  /ml, 
except  for  differences  in  the  site  where  oral  cavity  occlusion  actually  occurs.  For  Ini, 
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the  tip  of  the  tongue  touches  the  upper  alveolar  ridge  of  the  hard  palate,  with  the  back 
sides  of  the  tongue  touching  the  molars.  For  /ng/,  the  tongue  dorsum  touches  the 
posterior  part  of  the  hard  palate  or  the  soft  palate,  allowing  much  less  of  the  oral 
cavity  to  resonate  as  a side  branch  of  the  vocal  tract.  The  acoustic  results  of  closing 
the  oral  cavity,  while  keeping  the  velum  low,  are  found  in  the  occurrence  of 
antiformants  in  the  spectrum  of  nasal  consonants.  The  significance  of  such 
occurrences  are  discussed  in  greater  detail  later  in  the  text. 

In  continuous  speech,  the  vowels  preceding  or  following  nasal  consonants  are 
nasalized.  This  phenomenon  is  called  nasalization.  The  nasalized  vowels  are 
produced  when  the  mouth  is  open,  but  the  velum  is  kept  at  a low  position;  thus, 
enabling  the  connection  of  a side  branch  (or  nasal  tract)  to  the  vocal  tract.  It  is 
precisely  this  side  branch  which  introduces  antiformants  into  the  spectrum  of  the 
nasalized  vowels. 

The  realization  that  there  are  antiformants  in  the  spectra  of  nasal  sounds  is 
important,  since  listeners  use  acoustic  information  during  the  perception  of  a spoken 
message;  thus,  understanding  the  acoustic  properties  of  speech  is  critical  for  speech 
synthesis. 


Acoustic  Properties  of  Nasals 

While  speech  researchers  have  long  sought  to  gain  an  understanding  of  the 
main  spectral  properties  of  nasal  sounds,  the  most  significant  contributions  are 
largely  attributable  to  the  research  of  House  and  Stevens  [1956],  Hattori  et  al. 
[1958],  Fujimura  [1960,  1961,  1962],  Fujimura  and  Lindqvist  [1971],  and  Hawkins 
and  Stevens  [1985].  Theoretical  analysis  of  the  ideal  model,  sweep-frequency 
measurement  techniques,  speech  synthesis,  and  analysis  of  human  speech  are  among 
the  approaches  selected  for  their  respective  research  efforts.  Fujimura  [1962] 
applied  graphical  techniques  to  make  theoretical  predictions  regarding  the 


76 


distribution  of  formants  and  antiformants  of  nasals.  His  model  for  calculating  the 
transfer  function  contains  three  lossless  cavities,  that  is,  pharyngeal,  oral,  and  nasal 
cavities  (Figure  4-1).  For  all  nasal  consonants,  Fujimura’s  model  predicts:  (1)  a 
formant  in  the  range  200  to  300  Hz;  (2)  a second  formant  in  the  range  800  to  1500 
Hz,  and  (3)  a third  formant  in  the  range  2000  to  3000  Hz.  The  lower  ends  of  the  200 
to  300  Hz  and  800  to  1500  Hz  ranges  are  better  suited  to  the  case  of  /m/,  while  the 
higher  ends  are  more  applicable  to  the  case  of  /ng/.  The  first  zero,  which  appears  in 
conjunction  with  an  additional  formant,  may  occur  in  the  frequency  range  800  to 
1500  Hz  for  /m/,  2000  to  3000  Hz  for  Ini,  and  above  3000  Hz  for  /ng/.  Therefore,  the 
/m/,  Ini,  and  /ng/  nasals  are  characterized  by  the  low,  medium,  and  high  positions  of 
their  first  antiformant  occurrence  respectively.  The  antiformant  also  changes  its 
position  appreciably  from  word  to  word  depending  on  the  change  in  the  configuration 
of  the  oral  cavity.  While  other  formants  remain  relatively  constant,  the  antiformant 
appears  to  have  considerable  influence  on  the  formants  in  its  immediate  vicinity. 
Since  the  lossless  model  cannot  predict  the  bandwidths  of  formants  and 
antiformants,  Fujimura  [1962]  broadened  the  scope  of  his  research  by  investigating 
the  human  sound  spectra  of  nasal  consonants  in  various  vowel  contexts.  The 
bandwidths  of  formants  and  antiformants  are  shown  in  Table  4-1.  The  results 
showed  that  the  damping  of  the  resonants  and  the  antiresonants  is  another  important 
feature  of  nasal  consonants.  On  the  average,  bandwidths  of  nasal  consonant 
formants  are  comparable  to  or  greater  than  those  of  vowels.  Since  this  method 
cannot  eliminate  the  effects  of  the  voice  source,  the  bandwidth  values  are  not  very 
accurate. 

Hattori  et  al.  [1958]  employed  a Sonagraph  to  investigate  the  principle 
features  characteristic  of  the  nasalization  of  vowels.  They  examined  the  change  on 
the  sound  spectrograms  of  five  Japanese  vowels,  which  were  first  orally  pronounced 
and  then  suddenly  nasalized,  while  keeping  the  articulation  constant.  The 
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Figure  4-1.  The  model  for  calculating  the  transfer  function  of  nasal 
consonants. 
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Table  4-1.  Values  of  half-power  bandwidths  of  formants  and  antiformants  for  nasal 
consonants.  Averages  are  taken  for  spectral  samples  throughout  the  nasal  murmur 
in  five  vowel  contexts,  (after  Fujimura  [1962]) 


/m/ 

Ini 

/ng/ 

Formant  1 

60 

40 

80 

Formant  2 

60 

100 

100 

Formant  3 

90 

110 

230 

Formant  4 

280 

170 

100 

Formant  5 

170 

100 

... 

Antiformant 

80 

600 

... 
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conclusions  resulting  from  this  investigation  are  as  follows:  (1)  a broad  and  flat 
formant  appears  at  around  250  Hz;  (2)  an  antiformant  occurs  at  about  500  Hz;  and 

(3)  additional  weak  and  diffuse  components  fill  the  “valley”  between  the  formants  of 
the  vowels,  particularly  in  the  frequency  region  from  1000  to  2500  Hz.  The  actual 
frequency  regions  vary  from  vowel  to  vowel  and  also  from  person  to  person. 

Fujimura  [1960]  studied  the  spectra  of  nasalized  vowels  theoretically  by  using 
an  ideal  lossless  model  (Figure  4-2),  and  concluded: 

(1)  Nasalized  vowels  have  two  types  of  formants,  the  nasal  formant  and  the 
shifted-oral  formant.  Each  nasal  formant  is  paired  with  an  antiformant.  As  the  size 
of  the  velopharyngeal  port  decreases,  the  nasal  formant  and  antiformant  approach 
each  other.  If  the  size  of  the  velopharyngeal  port  is  zero,  an  annihilation  results.  The 
shifted-oral  formant  always  corresponds  to  a formant  of  non-nasalized  vowel 
without  discontinuity. 

(2)  The  first  formant  of  nasalized  vowels  can  be  either  a nasal  formant  or  a 
shifted-oral  formant  — depending  on  the  location  of  the  original  first  formant  for  the 
non-nasalized  configuration.  If  the  frequency  of  the  original  first  formant  is  higher 
than  the  lowest  resonant  frequency  of  the  nasal  tract  when  the  velum  is  closed,  the 
first  formant  is  a nasal  formant. 

(3)  The  shifted-oral  formants  are  always  higher  than  the  formants  of  a 
non-nasalized  vowel  with  the  same  vocal-tract  configuration. 

(4)  The  frequency  of  a nasal  formant  is  always  higher  than  the  frequency  at 
which  this  formant  would  be  annihilated,  when  the  vowel  is  denasalized. 

(5)  All  formants  shift  monotonically  upwards  as  the  degree  of  coupling 


increases. 
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Figure  4-2.  The  model  for  calculating  transfer  function  of  nasalized  vowels. 
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Difficulties  in  Synthesizing  Nasals 

As  previously  mentioned,  the  spectra  of  nasal  consonants  and  nasalized 
vowels  are  quite  complex  containing  both  formants  and  antiformants  which,  in  turn, 
results  in  an  increased  degree  of  difficulty  with  respect  to  synthesizing  nasal  sounds 
by  any  terminal  analog  synthesizer.  The  LPC  synthesizer  which  is  an  all-pole  model 
and  as  such  cannot  produce  the  antiformant;  thus,  it  is  impossible  to  synthesize 
high-quality  nasals  using  LPC  synthesizers. 

Although  the  parallel-configured  formant  synthesizer  can  produce 
antiformants  between  formant  peaks,  both  the  frequency  and  the  bandwidth  of  these 
antiformants  are  dependent  upon  the  formants.  Hence,  formant  synthesizers  cannot 
reproduce  nasals  with  the  same  spectra. 

The  nasals  are  produced  primarily  by  a lowering  of  the  velum,  which  can  be 
easily  simulated  by  the  articulatory  speech  synthesizer;  and  so,  the  articulatory 
synthesizer  has  the  capability  to  produce  high-quality  nasal  sounds.  In  order  to 
correctly  model  the  nasal  tract,  the  structure  of  the  nasal  tract  and  the  movement  of 
the  velum  must  be  taken  into  consideration. 

Nasal  Tract  and  Movement  of  Velum 

Of  the  entire  vocal  tract,  the  nasal  tract  is  less  accessible  for  measurements 
than  the  mouth  and  the  pharyngeal  tracts.  The  nasal  tract  begins  at  the  velum  and 
ends  at  the  nostrils.  The  overall  length  of  the  nasal  pathways  measured  from  uvula  to 
the  outlet  at  the  nostrils  is  about  12.5  cm  as  measured  from  X-ray  photography 
[Fant,  1960] . For  an  adult  male,  the  nasal  cavities  are  coupled  to  the  vocal  tract  at  a 
point  approximately  8 cm  from  the  glottis.  There  are  at  least  three  or  four  pairs  of 
sinuses  in  the  nose  [Lindqvist  and  Sundberg,  1972].  The  two  maxillary  sinuses  are 
symmetrically  situated  in  the  bone  on  the  right  and  left  sides  of  the  nasal  tract.  The 
two  frontal  sinuses  are  situated  above  the  nasal  tract  in  the  bone  of  the  forehead. 
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These  cavities  are  acoustically  coupled  to  the  nasal  tract  via  short  channels  within  the 
bone  structure.  The  maxillary  sinus  pair  is  the  largest  in  terms  of  volume,  about  20.8 
cm3,  and  has  a fairly  large  opening,  about  0.1  cm2  to  the  nasal  tract  [Maeda,  1982b]. 

During  continuous  speech,  the  entrance  to  the  chambers  of  the  nose  must  be 
closed  off  most  of  the  time  for  the  oral  sounds;  however,  it  must  be  open  for  the  three 
nasal  sounds.  The  entrance  to  the  large  nasal  chambers  from  the  pharyngeal  and 
oral  cavities  is  called  the  velopharyngeal  port.  It  can  be  closed  by  elevating  and 
backing  the  velum  until  it  approximates  the  posterior  pharyngeal  wall.  Talkers 
elevate  the  velum  posteriorly  to  achieve  the  tightest  seal  for  consonants,  especially  in 
the  case  of  fricatives  such  as  /s/,  because  these  sounds  require  large  intraoral 
pressure.  Any  leakage  into  the  nasal  cavities  would  decrease  the  required  pressure. 
The  degree  of  closure  of  the  velopharyngeal  port  varies  according  to  phonetic 
context — from  the  low  position  characteristic  of  nasals,  the  intermediate  positions 
characteristic  of  low  vowels,  the  more  nearly  closed  positions  characteristic  of  high 
vowels,  to  the  highest  positions  typical  of  consonants.  Bjork’s  [1961]  tomographic 
and  cineradiographic  study  concluded  that  the  velopharyngeal  port  area  is  a linear 
function  of  the  port’s  sagittal  minor  axis  and  that  the  constant  of  the  proportionality 
is  10  mm.  Hence,  the  anatomical  structure  of  the  port  is  more  nearly  rectangular. 
The  observations  of  several  investigators  (Bjork  [1961],  Warren  [1967],  and  Isshiki  et 
al.  [1968])  converge  toward  the  opinion  that  the  linguistically  useful  region  of  the 
velopharyngeal  control  lies  within  the  range  from  zero  to  slightly  more  than  1 cm2.  A 
general  rule  is  that  when  the  velum  comes  within  2 mm  of  the  pharynx  (producing  an 
open  area  of  about  20  mm2)  there  is  no  apparent  nasality;  on  the  contrary,  a wider 
opening  (5  mm  or  50  mm2  in  area)  produces  nasal  resonances  and  the  resulting 
speech  is  definitely  perceived  as  nasal  [Borden  and  Harris,  1984], 
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A greater  damping  of  resonance  in  the  nasal  part  than  in  the  oral  part  of  the 
vocal  tract  can  be  expected  due  to  the  greater  surface  outline  to  area.  Nostril  hair 
also  contributes  to  the  damping. 


Simulation  Results 

In  order  to  investigate  the  relationship  between  the  quality  of  synthetic  nasal 
sounds  and  the  control  parameters,  a sentence  containing  nasals  was  synthesized. 
Although  a human  speech  sample  was  used  to  derive  some  control  parameters,  no 
attempt  was  made  to  match,  exactly,  the  synthetic  spectrogram  with  its  natural 
counterpart.  However,  the  natural  speech  spectrogram  is  quite  useful  in  obtaining  a 
good  estimate  of  the  required  duration  of  each  segment  of  the  synthetic  sentence.  In 
addition,  the  pitch  and  the  intensity  contours  provide  the  phonation  information. 

In  the  following  two  experiments,  speech  and  EGG  data  (for  an  experienced 
male  speech  pathologist)  for  the  utterance  of  the  following  sentence,  “Ben  went 
mining.”,  were  collected;  and  then,  the  acoustic  speech  waveforms  and  the  EGG 
were  simultaneously  digitized  and  stored  on  a computer  disk  for  future  analysis.  The 
synchronized  speech  and  the  EGG  signal  were  recorded  with  the  talker  situated 
inside  an  Industrial  Acoustics  Company  (IAC)  single-wall  sound  room.  The 
microphone  was  placed  at  a fixed  distance  (6  inches)  from  the  talker’s  lips.  The 
signal  digitization  was  accomplished  by  the  Digital  Sound  Corporation  (DSC)  model 
200  stereo  A/D  and  D/A  system  which  has  a 16-bit  accuracy.  The  signals  were  then 
digitized  at  a sampling  frequency  of  10  KHz  — with  a 5 KHz  anti-aliasing  filter  being 
used  before  digitization. 

The  four  features  extracted  from  the  speech  and  the  EGG  signals  were  (1) 
pitch  contour;  (2)  intensity  contour;  (3)  glottal  open  quotient;  and  (4)  formant 
frequencies  as  functions  of  time.  The  articulatory  information  was  obtained 
heuristically  from  phonetic  considerations  and  from  X-ray  data  available  in 


84 


literature  [Fant,  1960;  Levinson  and  Schmidt,  1983;  Hecker,  1962],  while  the 
phonation  information  was  derived  from  the  pitch  contour,  intensity  contour  and  the 
glottal  open  quotient. 

Based  on  the  features  extracted  from  the  speech  sample  and  the  area 
functions  from  literature  [Fant,  1960],  the  trial-and-error  synthesis  process  begins 
(see  appendix  for  details).  After  obtaining  satisfactory  synthetic  speech,  listening 
tests  were  conducted  to  investigate  the  effects  of  the  maxillary  sinuses  and  the 
opening  of  the  velopharyngeal  port  on  the  quality  of  the  nasal  sounds.  The  judges  for 
the  listening  tests  in  this  research  were  two  professors  from  the  Speech  Department, 
both  of  whom  are  professional  voice  diagnosticians;  and  a professor,  who  is  an 
experienced  speech  scientist,  from  the  Electrical  Engineering  Department.  Two 
separate  experiments  were  conducted. 


Experiment  I 

This  experiment,  which  involves  speech  synthesis  and  perceptual  evaluation, 
was  conducted  in  order  to  study  the  effect  of  the  sinuses  on  the  quality  of  nasal 
sounds.  Figure  4-3  shows  the  area  functions  of  the  nasal  tract  and  the  maxillary 
sinuses  used  in  the  experiment. 

The  paired-comparison  listening  test  was  chosen  for  this  experiment.  The 
test  stimuli  were  synthetic  voices  of  “Ben  went  mining.”  They  were  synthesized  by 
using  the  articulatory  synthesizer  with  and  without  the  maxillary  sinuses  and  at  five 
different  levels  of  velopharyngeal  openings  (0.1,  0.2,  0.4,  0.6,  and  1.4  cm2).  During 
the  listening  test,  the  judges  were  asked  to  decide  which  voice  in  the  pair  was  more 
natural  with  respect  to  nasality.  The  synthetic  speech  samples  were  presented 
through  a tape  recorder  via  headphones  in  a professional  sound  room.  The  order  of 
presentation  of  the  pairs,  as  well  as  the  position  within  pairs  (i.e.  the  first  versus 
second  stimulus  of  the  pair)  were  randomized.  Preceded  by  a tone,  each  pair  of 
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Figure  4-3.  The  area  function  of  the  nasal  tract. 
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samples  was  repeated  twice.  Between  samples,  there  was  a 4-second  silence  during 
which  each  judge  was  expected  to  make  a choice. 

Table  4-2  lists  the  perceptual  ratings.  The  results  show  that  for  the 
velopharyngeal  openings  of  0.2  and  1.4  cm2  most  of  the  judges  preferred  the  nasal 
sounds  synthesized  with  the  maxillary  sinuses.  These  results,  unfortunately,  are  not 
reliable  — since  all  of  the  judges  thought  the  differences  within  the  pair  were  subtle, 
and,  in  fact,  one  judge  reported  he  almost  could  not  tell  the  differences  within  the 
pair.  Nevertheless,  it  is  highly  plausible  that  the  maxillary  sinuses  affect  the  quality 
of  the  nasal  sounds  to  a very  minimal  extent.  Figure  4-4  shows  the  waveforms  and 
the  spectra  of  nasal  consonants  synthesized  with  and  without  the  sinuses.  The 
antiformant  near  1800  Hz  was  deeper  in  the  spectrum  of  nasal  synthesized  by 
simulating  the  effect  of  the  maxillary  sinuses.  However,  the  waveforms  of  nasalized 
vowels  are  almost  the  same  (Figure  4-5).  This  result  is  in  direct  conflict  with 
Maeda’s  [1982b]  result.  He  claimed  that  when  an  acoustic  tube  having  the 
appropriate  area  function  but  no  side  branching  cavities  was  used  as  the  nasal  tract, 
the  low  frequency  nasal  formant  below  the  first  formant  of  the  low  vowels  (as 
observed  in  natural  speech)  could  not  be  simulated.  He  also  claimed  that  although 
the  vowel  quality  was  modified,  the  synthetic  nasalized  mid  and  low  vowels  did  not 
sound  nasalized. 

There  are  two  reasons  which  explain  why  the  maxillary  sinuses  affect  the 
nasalized  vowel  to  such  a small  extent.  First,  the  input  impedance  of  the  nasal  tract 
at  the  velopharyngeal  port  is  changed  very  little  by  adding  the  sinuses,  since  the  nasal 
tract  has  a large  damping  in  it  and  the  maxillary  sinus  pair  is  near  the  nostrils. 
Second,  the  volume  velocity  at  the  nostrils  is  much  smaller  than  the  volume  velocity 
at  the  lips  during  the  nasalized  vowels,  while  the  sound  pressure  is  the  differential  of 
the  sum  of  these  two  volume  velocities.  Although  the  volume  velocity  at  the  nostrils 
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Table  4-2.  The  size  of  velopharyngeal  opening  and  the  preference  for  nasal 
sounds  synthesized  with  or  without  the  maxillary  sinuses. 


Sample 
No . 

Size  of 
geal 

Velopharyn- 

Opening 

Preference 

Jl 

J2 

J3 

Avg . 

1 

0.1 

sq . cm 

NS 

S 

NS 

NS 

2 

0.2 

sq . cm 

S 

S 

NS 

S 

3 

0.4 

sq . cm 

NS 

NS 

NS 

NS 

4 

0.6 

sq . cm 

S 

NS 

NS 

NS 

5 

1.4 

sq . cm 

S 

S 

S 

S 

* Preference: 

NS  means  prefer  the  speech  sample  synthesized  without  the  maxillary 
sinuses. 

S means  prefer  the  speech  sample  synthesized  with  the  maxillary  sinuses. 
*Ji  represents  judge  i,  i=l,  2,  3. 

*The  sample  sentence  is  ”Ben  went  mining.” 
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Figure  4-4.  The  speech  waveforms  of  nasal  sounds  /m/  synthesized  with 
(bottom)  and  without  (upper)  maxillary  sinuses. 
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Figure  4-4.  (continued)  The  corresponding  spectra. 
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Figure  4-5.  The  waveforms  of  nasalized  vowel  synthesized  with  (dark  line) 
and  without  (light  line)  the  maxillary  sinuses. 
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is  changed,  the  sound  of  nasalized  vowels  is  changed  only  slightly  — since  the 
volume  velocity  at  the  nostrils  is  in  itself  a small  part  of  the  sum. 


Experiment  II 

The  purpose  of  this  experiment  was  to  correlate  the  nasality  of  the  synthetic 
voices  to  the  opening  of  the  velopharyngeal  port.  The  stimuli  of  this  experiment  were 
the  same  as  in  Experiment  I,  with  the  exception  that  voices  produced  with  the 
maxillary  sinuses  were  used. 

Nasality  is  a characteristic  parameter  of  a talker’s  voice.  Failure  to  make 
perceptually  acceptable  adjustments  of  the  velopharyngeal  mechanism  can  be 
divided  into  two  disorders:  hypernasality  which  is  characterized  by  too  much  nasal 
resonance  and  hyponasality  which  is  characterized  by  too  little  nasal  resonance  on 
/m/,  Ini  and  /ng/  [Borden  and  Harris,  1984].  During  the  listening  test,  the  judges 
were  asked  to  rate  the  nasality  of  the  voices  on  a 1 to  7 scale,  where  a rating  of  7 
represents  a high  degree  of  nasality  and  rating  of  1 represents  the  absence  of 
nasality.  The  order  of  presentation  of  the  samples  was  randomized  with  each  sample 
repeated  twice  and  preceded  by  a tone. 

The  results  of  the  perceptual  rating  as  listed  in  Table  4-3  showed  that  the 
perception  of  nasality  was  related  to  the  opening  of  velopharyngeal  port.  Except  for 
judge  1,  the  other  two  judges  consistently  reported  that  nasality  increases  with  the 
increase  of  the  opening  of  the  velopharyngeal  port.  Figure  4-6  illustrates  the 
waveforms  and  the  spectra  of  nasal  consonants  and  nasalized  vowels  with  different 
opening  of  the  velopharyngeal  port.  The  spectra  of  the  nasal  consonant  Iml  showed 
that:  (1)  the  first  zero,  which  appeared  in  conjunction  with  an  additional  formant, 
occurred  at  about  1.8  KHz,  (2)  the  first,  second,  and  third  formants  were  at  about 
250,  900,  and  2400  Hz  respectively.  These  values  were  in  agreement  with  Fujimura’s 
predictions  [Fujimura,  1961].  With  the  increase  of  the  opening  of  the  velopharyngeal 
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Table  4-3.  The  size  of  velopharyngeal  opening  and  nasality  ratings  for 
testing  sentence  samples  (”Ben  went  mining.”). 


Sample 

No. 

Size  of  Velopharyn- 
geal Opening 

Nasality  Ratings 

Jl 

J2 

J3 

Avg . 

1 

0.1  sq.cm 

6 

3 

2 

3.7 

2 

0.2  sq.cm 

2 

4 

2 

2.7 

3 

0.4  sq.cm 

3 

4 

3 

3.3 

4 

0.6  sq.cm 

4 

5 

5 

4.7 

5 

1.4  sq.cm 

3 

5 

6 

4.7 

The  nasality  rating  scale  is  from  1 to  7.  7 represents  a high  degree  of 
nasality  and  1 represents  the  absence  of  nasality. 


*Ji  represents  judge  i,  i=l,  2,  3. 
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Figure  4-6.  The  speech  waveforms  and  spectra  of  the  nasal  sounds, 
(a)  Speech  waveformes  of  nasal  consonant  /m/. 
upper:  velopharyngeal  opening  = 0.1  sq.cm, 
middle:  velopharyngeal  opening  = 0.4  sq.cm, 
bottom:  velopharyngeal  opening  = 1.4  sq.cm. 
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Figure  4-6.  (continued) 

(b)  Spectra  of  nasal  cosonant  /m/. 

upper:  velopharyngeal  opening  = 0.1  sq.cm, 
middle:  velopharyngeal  opening  = 0.4  sq.cm, 
bottom:  velopharyngeal  opening  = 1.4  sq.cm. 
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Figure  4-6.  (continued) 

(c)  Speech  waveforms  of  nasalized  vowel  /a/. 

upper:  velopharyngeal  opening  = 0.1  sq.cm, 
middle:  velopharyngeal  opening  = 0.4  sq.cm, 
bottom:  velopharyngeal  opening  = 1.4  sq.cm. 
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Figure  4-6.  (continued) 

(d)  Spectra  of  nasalized  vowel  /a/. 

upper:  velopharyngeal  opening  = 0.1  sq.cm, 
middle:  velopharyngeal  opening  = 0.4  sq.cm, 
bottom:  velopharyngeal  opening  = 1.4  sq.cm. 
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port,  the  valleys  in  the  spectra  become  deeper  and  deeper.  The  results  also  showed 
that  the  amount  of  velopharyngeal  opening  for  nasals  as  determined  by  the 
experimental  results  was  in  agreement  with  the  data  found  in  the  literature  (Bjork 
[1961],  Warren  [1967],  and  Isshiki  et  al.  [1968]).  This  indicated  that  the  proposed 
articulatory  speech  synthesizer  effectively  simulated  the  nasal  tract.  These  results 
suggested  that  for  synthesizing  nasal  sounds,  the  velopharyngeal  opening  should 
assign  a value  about  0.4  - 0.6  cm2. 

There  are  two  limitations  in  this  experiment.  First,  the  target  cross-sectional 
area  functions  of  nasal  consonants  used  in  this  experiment  (from  Fant’s  [I960])  were 
not  accurate,  since  they  are  dependent  upon  the  context  in  which  the  nasal  consonant 
is  produced.  But  Fant  just  provided  two  different  area  functions  for  every  nasal 
consonant.  Second,  the  scheme  used  to  simulate  activity  of  the  vocal  tract  during 
those  time  intervals  usually  referred  to  as  articulatory  transitions  was  oversimplified. 
Since  the  dynamics  of  the  human  articulatory  apparatus  are  not  fully  understood,  the 
scheme  which  was  employed  in  this  research  was  based  on  linear  interpolation.  For 
example,  in  the  production  of  a CV  syllable,  two  suitable  vocal-tract  area  functions 
were  selected.  One  area  function  approximated  the  the  vocal  tract  configuration 
during  an  idealized,  static  consonantal  articulation,  while  the  other  area  function 
approximated  the  configuration  of  the  vocal  tract  during  an  idealized,  static  vocalic 
articulation.  A linear  interpolation  of  these  two  area  functions  was  used  to  simulate 
the  articulatory  transition.  In  real  situations,  some  parts  of  the  human  articulatory 
apparatus,  particularly  the  lips  and  the  tip  of  the  tongue,  can  move  very  rapidly,  while 
other  parts,  such  as  the  back  of  the  tongue  and  the  lower  pharynx,  are  more  restricted 
in  their  motion  [Borden  and  Harris,  1984]. 


CHAPTER  5 

SYNTHESIZING  FEMALE  VOICE 


Text-to-speech  systems  are  widely  used  for  information  services  and 
handicap  aids.  Although  the  intelligibility  of  the  synthetic  speech  produced  by  these 
systems  is  of  paramount  importance,  quality  and  naturalness  have  a great  effect  on 
the  usefulness  and  the  acceptability  of  text-to-speech  systems.  In  some  applications 
text-to-speech  systems  are  required  to  produce  several  different  voices,  a male  voice 
and  a female  voice  for  example.  It  is  well  established  that  in  synthetic  speech,  the 
female  voice  has  hot  been  reproduced  with  the  same  level  of  success  as  the  male 
voice  [Monsen  and  Engebretson,  1977;  Klatt,  1987].  This  is  because:  (1)  almost  all 
existing  speech  synthesizers  are  scaled  after  a male  prototype;  and  (2)  the 
anatomical  differences  of  the  vocal  tract  and  vocal  folds  between  males  and  females 
have  not  been  investigated  extensively  [Fant,  1980]. 

This  chapter  discusses  how  to  synthesize  female  voices  using  the  articulatory 
synthesizer.  The  acoustic  properties  of  the  female  voice  will  be  discussed  first, 
followed  by  a review  of  the  difficulties  involved  in  the  synthesis  of  the  female  voice 
using  LPC  and  formant  synthesizers.  Then,  an  overview  of  the  physiological 
differences  of  the  vocal  folds  and  the  vocal  tract  between  females  and  males  is 
provided.  Next,  the  noise  generator  at  the  glottis,  which  is  important  for  synthesizing 
breathy  voices,  is  discussed  in  detail.  The  chapter  concludes  with  a discussion  of  the 
voice  conversion  method  which  is  used  to  identify  the  major  control  parameters  of 
the  articulatory  synthesizer  for  producing  female  voices. 
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Properties  of  Female  Voice 

The  acoustic  properties  of  the  female  voice  have  been  studied  for  many  years. 
Research  demonstrates  that  adult  males  and  adult  females  have  markedly  different 
fundamental  frequencies  of  their  gender.  For  example,  Hollien  and  Jackson  [1973] 
report  a mean  fundamental  frequency  of  123  Hz  for  young  adult  males,  while  Linke 
[1973]  reports  that  young  adult  females  have  a mean  fundamental  frequency  of  200 
Hz.  A comparison  of  the  fundamental  frequency  for  young  adults  reveals  that  the 
difference  is  approximately  8 semitones  or  2/3  octave.  Although  this  difference 
fluctuates  at  various  ages,  the  general  magnitude  is  maintained  throughout 
adulthood.  Based  on  this  determination,  Coleman  [1976]  and  Lass  et  al.  [1976] 
suggest  that  fundamental  frequency  is  the  prime  determining  factor  in  the  perception 
of  sex  from  speech. 

Along  with  the  distinctly  different  fundamental  frequencies  typically 
associated  with  adult  males  and  females,  formant  patterns,  a second  prominent 
feature,  were  studied  for  their  contribution  to  sex  identification  from  speech 
[Coleman,  1976;  Lass  et  al.,  1976].  The  pattern  of  formant  frequencies  for  each 
articulated  vowel  is  at  higher  values  for  females  than  for  males  [Peterson  and 
Barney,  1952].  Formant  frequencies  for  the  female  voice  are  typically  scaled  up 
from  those  of  the  male  voice  by  a factor  of  1.2  or  more.  Investigators  who  have 
studied  the  vowel  productions  of  adults  have  noticed  that  sexual  distinctions  are  both 
vowel-dependent  and  formant-dependent.  Fant  [1973;  1975]  used  formant  scale 
factors  (K-factors)  to  describe  the  percentage  relationship  of  male  and  female 
formant  frequencies.  Using  the  formula  Kn%  = 100*(Fnf/Fnm  - 1),  where  Fnf  and  Fnm 
are  the  nth  formant  of  the  female  and  the  male  respectively,  he  determined  that 
male/female  differences  were  largest  for  F2  and  F3  of  the  front  vowels,  while  the 
differences  were  smaller  for  Fi  and  F2  of  the  back  vowels  and  Fi  of  the  close  front 
vowels.  Fant  [1975]  later  demonstrated  the  consistency  of  this  phenomenon  by 
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showing  that  the  pattern  of  sexual  differences  across  the  various  vowels  was  similar 
for  speakers  of  several  different  languages. 

Klatt  [1987]  reported  that  the  strength  of  the  fundamental  component  is 
greater  and  the  general  tilt  of  the  harmonic  spectrum  is  steeper  in  the  spectrum  of  the 
female  voice  than  that  in  the  spectrum  of  the  male  voice.  These  features  are  related 
to  the  glottal  excitation  waveforms.  Monsen  and  Engebretson  [1977]  examined  the 
glottal  waveforms  of  two  males  and  two  females.  They  found  a difference  in  the 
symmetry  of  the  waveforms.  The  waveshape  produced  by  male  subjects  is  typically 
asymmetrical  and  frequently  shows  a prominent  hump  in  the  opening  phase  of  the 
wave.  The  closing  portion  of  the  wave  generally  occupies  20%  - 40%  of  the  total 
period  and  there  may  or  may  not  be  an  easily  identifiable  closed  period.  The  female 
waveform  tends  more  toward  symmetry.  There  is  seldom  a hump  during  the  opening 
phase  and  both  the  opening  and  closing  parts  of  the  wave  occupy  more  nearly  equal 
proportions  of  the  period.  Carrell  [1981]  examined  the  perceptual  importance  of  this 
difference  and  found  that  the  glottal  source  plays  a much  more  important  role  in  the 
perception  of  the  talker’s  sex  than  the  formant  patterns. 

Many  researchers  believe  that  breathiness  is  a typical  feature  of  female  voice 
[Koike  and  Hirano,  1973;  Kitzing  and  Sonesson,  1974;  Hildebrand,  1976;  Klatt, 
1987].  Breathiness  is  a quality  which  is  quite  often  heard  as  a modification  of  the 
modal  voice  in  female  voices.  Based  on  a detailed  spectral  analysis  of  a female 
talker’s  voice,  Klatt  [1986]  discovered  that  considerable  random  breathiness  noise 
exists  at  frequencies  above  2 KHz  over  portions  of  many  utterances  in  the  female 
voice. 


Difficulties  in  the  Synthesis  of  the  Female  Voice 
The  LPC  is  a good  analysis/synthesis  system.  Multi-pulse  LPC  synthesizers 
[Atal  and  Remde  1982]  can  produce  a synthetic  speech  that  is  perceptually  nearly 
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indistinguishable  from  the  original,  but  anomalies  arise  when  using  this  approach  to 
create  new  voice  types.  Since  there  are  no  rules  for  assigning  the  LPC  coefficients  to 
produce  the  desired  voices,  just  changing  the  fundamental  frequency  (this  would  be 
required  in  a text-to-speech  device)  will  degrade  the  quality  of  the  synthetic  speech. 
The  reason  for  this  degradation  is  that  the  predictor  equations,  in  the  autocorrelation 
form,  do  not  estimate  the  formant  frequencies  nor  the  bandwidths  accurately.  This  is 
not  a problem  provided  that  one  uses  the  same  fundamental  frequency  during 
resynthesis  because  the  error  is  undone;  but  if  a new  fundamental  frequency  is 
employed,  the  first  formant  may  be  in  error  by  8%  or  more  [Atal  and  Schroeder 
1975;  Klatt  1986]  and  formant  bandwidths  can  be  seriously  deviant.  Additional 
losses  to  naturalness  occur  if  smoothing  at  the  segment  boundaries  results  in  too 
rapid  a change  in  synthesis  parameters  [Klatt,  1987]. 

Formant  synthesizers  can  successfully  produce  the  male  voice  by  rules.  But  a 
simple  scaling  of  formants  (by  a factor  of  1.15)  and  fundamental  frequency  (by  a 
factor  of  1.7)  does  not  result  in  a particularly  good  female  voice  quality  [Klatt,  1987] . 
Nonuniform  vowel-dependent  formant  scaling  appears  to  be  required  which,  in  turn, 
makes  the  synthesis  rule  more  complicated.  The  general  rules  for  a text-to-speech 
application  have  yet  to  be  fomulated.  Besides,  modification  of  the  glottal  source 
model  also  becomes  necessary. 

An  alternative  solution  to  the  problem  of  producing  a natural  female  voice 
quality  by  a formant  synthesizer  might  be  to  employ  articulatory  models  of  the 
trachea,  vocal  folds,  and  vocal  tract,  as  well  as  their  interactions,  in  a sophisticated 
articulatory  synthesizer  [Klatt,  1987],  Since  articulatory  synthesizers  simulate  the 
human  speech  production  system,  the  physiological  differences  of  the  vocal  folds  and 
the  vocal  tract  between  females  and  males  should  be  investigated  first. 
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Female  Vocal  Folds  and  Tract 

The  range  of  fundamental  frequencies  that  can  be  used  comfortably  is 
determined  by  the  physical  properties  of  the  glottal  structures.  The  larger  the 
vibrating  mass  of  the  vocal  folds,  the  lower  the  fundamental  frequency  [Ishizaka  and 
Flanagan,  1972].  In  general,  women  have  smaller  vocal  folds  than  men.  Whereas 
male  vocal  fold  lengths  are  more  apt  to  range  between  17  and  24  mm,  female  vocal 
fold  lengths  are  more  closely  approximated  by  13  to  17  mm  [Borden  and  Harris, 
1984]. 

The  different  formant  patterns  between  the  sexes  are  mainly  due  to  the  size  of 
the  vocal  tract.  According  to  Fant  [1973],  the  ratio  of  the  total  length  of  the  female 
vocal  tract  to  that  of  a male  vocal  tract  is  about  0.87.  The  non-linear  formant  scale 
factors  result  from  the  descent  of  the  larynx  in  males  which  occurs  during  puberty. 
Fant  [1973]  provides  some  anatomic  data  to  show  that  although  adult  males  and 
females  differ  with  respect  to  both  oral  and  pharyngeal  cavity  dimensions,  the  sexual 
distinction  is  largest  for  the  pharynx.  The  female  pharynx  is  much  shorter  than  the 
male  pharynx  compared  to  the  oral  tract. 

Differences  in  the  strength  of  the  fundamental  component  and  the  general  tilt 
of  the  harmonic  spectrum  are  related  to  the  vibratory  pattern  of  the  vocal  folds. 
During  voicing,  female  vocal  folds  attempt  not  to  close  completely  and  the  shape  of 
female  glottal  area  function  is  more  symmetrical  [Kitzing  and  Sonerson,  1974].  The 
vibratory  pattern  of  the  vocal  folds  affects  the  glottal  waveforms.  It  is  now  commonly 
agreed  that  both  the  perceptual  quality  and  the  naturalness  of  synthetic  speech  can  be 
improved  by  using  an  appropriate  source  model  during  voiced  segments  of  speech 
[Rosenberg,  1971;  Holmes,  1973;  Childers  et  al.,  1987;  Childers  and  Wu,  1989]. 

“Breathiness”  is  also  related  to  the  motion  of  the  vocal  folds.  In  comparison 
to  the  modal  voice,  the  mode  of  vibration  of  the  vocal  folds  during  breathy  voice  is 
inefficient,  accompanied  by  a slight  audible  friction.  Muscular  effort  is  low  which 


103 


results  in  the  glottis  being  kept  somewhat  open  along  most  of  its  length;  and  thus,  the 
folds  never  meet  on  the  mid-line.  Because  each  closing  movement  of  the  folds  tends 
to  be  abortive,  the  lowered  glottal  resistance  leads  to  a higher  rate  of  air  flow  than  in 
the  modal  voice  [Laver,  1980].  This  higher  rate  of  air  flow  will  generate  turbulence 
at  the  glottis.  For  breathy  voice  synthesis,  a noise  generator  at  the  glottis  must  be 
added  to  the  articulatory  synthesizer. 

Noise  Generator  at  the  Glottis 

The  acoustic  analysis  of  breathy  voices  reveals  a high-frequency  noise 
component  which  originates  at  the  glottis.  This  noise  component  is  generated  by  a 
turbulent  air  flow  at  the  glottis.  According  to  physics  theory,  the  air  flow  in  a 
cylindrical  tube  is  either  laminar  or  turbulent  depending  upon  the  speed  of  the  flow 
[Daugherty  and  Ingersoll,  1954].  The  critical  condition  which  occurs  when  the  air 
flow  changes  from  laminar  to  turbulent  is  determined  by  the  Reynolds  number  (Re) 
which  is  expressed  by  the  equation  [Flanagan  and  Ishizaka,  1976] 

Be  -S±* 

where 

h is  the  effective  width  of  the  stricture, 
v is  the  velocity  of  air  flow,  and 
p is  the  coefficient  of  viscosity. 

The  effective  width,  h,  is  defined  as  4 A/S,  where  A is  the  cross  sectional  area  and  S is 
the  circumference  of  the  cross  section  of  the  tube. 

If  the  shape  of  the  glottis  is  approximated  by  a rectangular  slit  with  a long  side 
lg,  then  the  Reynolds  number  is  defined  by 
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where  f/g  is  the  volume  velocity  at  glottis. 

When  the  Reynolds  number  exceeds  a certain  value,  or  the  critical  Reynolds 
number  (Rec),  the  laminar  flow  becomes  turbulent.  Thus,  during  vocal  fold 
vibration,  the  generation  of  the  glottal  turbulent  noise  is  a result  of  the  turbulent  air 
flow. 

Isshiki  et  al.  [1978]  studied  glottal  turbulent  noise  by  using  a life-size 
laryngeal  model.  They  report  that  the  critical  Reynolds  number  for  their  laryngeal 
model  is  approximately  2,000.  Additionally,  their  experiment  reveals  that  the  sound 
pressure  of  the  noise  is  nearly  proportional  to  the  square  of  the  Reynolds  number. 
Isshiki  et  al.  [1978]  also  investigated  the  spectral  characteristics  of  the  turbulent 
noise.  Their  data  show  that  the  energy  of  the  turbulent  noise  is  distributed  over  a 
wide  range  of  frequencies  (2-8  KHz),  with  some  accentuation  in  the  4 KHz  region. 

For  simulating  aspiration,  the  suggestions  of  Fant  [1960],  Flanagan  [1972b], 
and  Flanagan  et  al.  [1975]  were  followed.  A noise  pressure  source  Png  located  at  the 
interface  between  the  expansion  of  the  vocal  fold  and  the  first  section  of  the  vocal 
tract  was  added.  The  pressure  of  Png  is  proportional  to  the  difference  of  the  squared 
Reynolds  number  Re2  and  a critical  Reynolds  number  Rec2.  Thus, 

Png  = G*  RANDOM* (Re2  - Rec2),  Re  > Rec 
= 0,  Re  < Rec 

where  G is  an  empirically  determined  gain  (about  2 x 10s),  RANDOM  is  a random 
number  uniformly  distributed  between  -0.5  and  0.5,  and  Rec2  is  about  27002  [Sondhi 
and  Schroeter,  1987]. 
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Kev  Parameters  for  Producing  Female  Voice 

Voice  conversion  was  selected  as  the  synthesis  task  to  aid  in  the  identification 
of  the  key  control  parameters  for  producing  female  voices.  The  flexible  articulatory 
speech  synthesizer  is  capable  of  independently  manipulating  not  only  the  vocal  tract 
system,  but  the  pitch  contour  and  the  glottal  waveshape  as  well.  Use  of  the  proposed 
synthesizer  for  studying  various  aspects  of  the  speech  signal,  such  as  the  importance 
of  different  parameters  or  features  and  their  effect  on  quality  of  female  voice,  is 
relatively  convenient.  Sentences  were  first  synthesized  according  to  a male  voice, 
then  speech  samples  of  the  female  voice  were  synthesized  by  systematically  varying 
the  control  parameters.  Finally,  listening  tests  were  conducted  to  evaluate  the 
perceptual  correlations  of  the  control  parameters  and  to  provide  useful  information 
regarding  those  control  parameters  necessary  to  achieve  female  voice  characteristics. 

In  the  experiments  conducted  for  this  research  effort  speech  and  EGG  data 
were  collected  for  a male  experienced  speech  pathologist  for  the  utterances  of  two 
sentences,  “Good-bye  Bob.”  and  “We  were  away  a year  ago.”  The  data  collection 
procedures  and  the  control  parameter  assignments  are  similar  to  those  used  in 
Chapter  4. 


Experiment  I 

The  purpose  of  this  experiment  was  to  correlate  the  quality  of  the  synthetic 
female  voice  to  the  control  parameters  of  the  proposed  articulatory  synthesizer.  The 
stimuli  for  this  listening  test  were  synthetic  voices  produced  by  the  articulatory 
synthesizer.  The  control  parameters  for  producing  these  stimuli  were  scaled  from 
control  parameters  producing  male  voices.  For  the  control  parameters  under 
investigation,  synthetic  speech  was  produced  by  changing  the  appropriate  control 
parameters  while  not  varying  the  others. 
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Four  groups  of  control  parameters  were  investigated  in  this  experiment. 
These  groups  and  their  scaling  factors  were  (1)  the  fundamental  frequency 
multiplied  by  a factor  of  1.9;  (2)  the  vocal  tract  area  function  — the  length  of  the 
pharyngeal  part  by  a factor  of  0.8,  the  length  of  the  oral  part  by  a factor  of  0.9;  the 
area  of  the  pharyngeal  part  by  a factor  of  0.64,  and  the  area  of  the  oral  part  by  a 
factor  of  0.81  (Figure  5-1);  (3)  the  vocal  fold  parameters  — the  length  and  thickness 
of  the  vocal  folds  by  a factor  of  0.8,  and  the  waveform  of  the  glottal  area  function  is 
different  (Figure  5-2);  and  (4)  the  generation  of  noise  at  the  glottis  or  not.  For 
producing  stimuli,  the  control  parameters  were  modified  by  changing  these 
parameters  — individually,  at  first,  then  later  in  every  possible  combination.  After 
successfully  synthesizing  the  male  voice,  the  control  parameters  were  systematically 
changed  to  generate  female  voices.  For  example,  parameters  were  scaled  according 
to  the  average  parameters  of  the  females.  No  attempt  was  made  to  match  a specific 
target  female  voice.  Following  the  above  procedure  of  varying  each  parameter 
individually  and  every  possible  combination,  we  synthesized  fifteen  ’’female”  voices 
from  the  one  male  voice. 

During  the  listening  test,  the  judges  were  asked  to  decide  whether  the 
synthetic  sample  sounded  like  a female  voice  or  male  voice  and  to  rate  the 
naturalness  of  the  voice  on  a 1 to  7 scale.  A rating  of  7 represents  a very  natural 
quality  and  a rating  of  1 represents  an  unnatural  quality. 

Table  5-1  presents  the  perceptual  ratings  of  the  three  judges  along  with  the 
key  control  parameters.  Sample  1-1  is  the  original  synthetic  male  voice.  Sample  1-2 
represents  the  first  ’’female”  voice  synthesized  by  changing  the  fundamental 
frequency  of  voicing  only.  Two  of  the  judges  thought  the  synthesized  voice  sounded 
female-like,  the  other  judge  considered  the  voice  male.  Note  that  no  average  rating 
was  computed  because  of  the  disagreement  amongst  the  judges  to  the  gender  of  the 
voices.  The  remaining  entries  in  Table  5-1  represent  the  results  for  the  other 
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(a) 


(b) 


(section  number) 


Figure  5-1.  Scale  factors. 

(a)  Scale  factor  for  vocal  tract  length. 

(b)  Scale  factor  for  vocal  tract  area. 
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Figure  5-2.  The  glottal  area  functions. 

upper:  for  male  voice,  Agamp=0.2  sq.cm,  Ag0=0  sq.cm,  Qo=0.6, 

Qs=15. 

bottom:  for  female  voice,  Agamp=0.15  sq.cm,  AgO=0.02  sq.cm,  Q0=l, 

Qs=l. 
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Table  5-1.  The  control  parameters  and  naturalness  ratings  for  the  test 
sentence  samples. 


Control 

Pa  r amete  r s 

Naturalness 

Ratings 

No  . 

Pitch 

Tract 

Area 

Glottal 

Waveform 

Glottal 

Noise 

J1 

J 2 

J 3 

Avg  . 

i-i 

M 

M 

M 

no 

M2 

M6 

M6 

M4  . 7 

1-2 

F 

M 

M 

no 

FI 

F 3 

M3 

— 

1-3 

M 

F 

M 

no 

M2 

M6 

M5 

M4  . 3 

1-4 

M 

M 

F 

no 

M3 

M5 

M6 

M4 . 7 

1-5 

M 

M 

M 

yes 

M3 

M6 

M5 

M4 . 7 

1-6 

F 

F 

M 

no 

F 3 

F 4 

F 5 

F 4 . 0 

1-7 

F 

M 

F 

no 

F2 

F 5 

M3 

— 

1-8 

F 

M 

M 

yes 

F 3 

F 3 

M2 

— 

1-9 

M 

F 

F 

no 

M2 

M3 

M4 

M3 . 0 

1-10 

M 

F 

M 

yes 

M2 

M5 

M5 

M4 . 0 

1-11 

M 

M 

F 

yes 

M3 

M2 

M7 

M4  . 0 

1-12 

F 

F 

F 

no 

F 2 

F 5 

F6 

F4  . 3 

1-13 

F 

F 

M 

yes 

FI 

F 3 

F 2 

F 2 . 0 

1-14 

F 

M 

F 

yes 

F 2 

F 4 

M4 

— 

1-15 

M 

F 

F 

yes 

M3 

M2 

M7 

M4 . 0 

1-16 

F 

F 

F 

yes 

F 2 

F 5 

F 3 

F 3 . 3 

*Naturalness  Ratings: 

M means  that  the  judge  perceived  the  speech  sample  as  male  voice. 

F means  that  the  judge  perceived  the  speech  sample  as  female  voice. 
The  number  is  the  naturalness  rating. 

The  naturalness  rating  scale  is  from  1 to  7.  7 represents  a very  natural 
quality  and  1 represents  an  unnatural  quality. 

* Control  Parameters: 

M means  male  voice  or  using  male  control  parameters. 

F means  female  voice  or  using  female  control  parameters. 

*Ji  represents  judge  i,  i=l,  2,  3. 

The  sentence  is  ”We  were  away  a year  ago.” 
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Table  5-1.  (continued) 


Sample 

Control 

Parameters 

Naturalness 

Ratings 

No  . 

Pitch 

Tract 

Area 

Glottal 
Wavef orm 

Glottal 

Noise 

J1 

J 2 

J 3 

Avg . 

2-1 

M 

M 

M 

no 

M3 

M3 

M3 

M3 . 0 

2-2 

F 

M 

M 

no 

M3 

FI 

M2 

— 

2-3 

M 

F 

M 

no 

M3 

M3 

M3 

M3  . 0 

2-4 

M 

M 

F 

no 

M3 

M4 

M6 

M4  . 3 

2-5 

M 

M 

M 

yes 

M3 

M5 

M5 

M4  . 3 

2-6 

F 

F 

M 

no 

M3 

F 4 

M2 

— 

2-7 

F 

M 

F 

no 

M3 

F 5 

M2 

— 

2-8 

F 

M 

M 

yes 

M2 

M4 

M2 

M2 . 7 

2-9 

M 

F 

F 

no 

M2 

M3 

M5 

M3  . 3 

2-10 

M 

F 

M 

yes 

M3 

M3 

M4 

M3 . 3 

2-11 

M 

M 

F 

yes 

M3 

M3 

M4 

M3  . 3 

2-12 

F 

F 

F 

no 

F 2 

F 4 

F 2 

F2 . 7 

2-13 

F 

• F 

M 

yes 

M3 

F 5 

F 2 

— 

2-14 

F 

M 

F 

ye  s 

M2 

F 2 

Ml 

— 

2-15 

M 

F 

F 

yes 

M3 

M3 

M4 

M3  . 3 

2-16 

F 

F 

F 

yes 

F 2 

F 4 

FI 

F 2 . 3 

‘Naturalness  Ratings: 

M means  that  the  judge  perceived  the  speech  sample  as  male  voice. 

F means  that  the  judge  perceived  the  speech  sample  as  female  voice. 
The  number  is  the  naturalness  rating. 

The  naturalness  rating  scale  is  from  1 to  7.  7 represents  a very  natural 
quality  and  1 represents  an  unnatural  quality. 

•Control  Parameters: 

M means  male  voice  or  using  male  control  parameters. 

F means  female  voice  or  using  female  control  parameters. 

*Ji  represents  judge  i,  i=l,  2,  3. 

•The  sentence  is  ’’Good-bye  Bob.” 


Ill 


’’female”  voices.  Although  each  judge  rated  the  samples  differently,  the  results 
suggest  that:  (1)  if  only  one  control  parameter  can  be  changed,  then  the  pitch  contour 
is  the  key  parameter  for  producing  the  female  voice;  (2)  if  two  control  parameters 
can  be  changed,  then  the  pitch  contour  and  the  vocal  tract  area  function  become  the 
key  control  parameters  for  producing  the  female  voice;  (3)  in  order  to  synthesize  a 
female  voice,  all  three  control  parameters  (pitch  contour,  vocal  tract  shape  and  the 
pattern  of  movement  of  the  vocal  folds)  should  have  correct  values;  and  (4)  in  order 
to  convert  a male  voice  to  a female  voice  the  rules  for  scaling  the  control  parameters 
may  be  sensitive  to  sentence  context.  The  results  of  this  experiment  supported  the 
notion  that  fundamental  frequency  (pitch)  was  more  important  than  were  the 
formant  patterns  for  judging  the  degree  of  maleness  or  femaleness.  Previously, 
experiments  were  conducted  [Coleman,  1976;  Loss  et  al.,  1976]  to  decide  which 
acoustic  characteristic  (pitch  or  formant  pattern)  is  mostly  responsible  for  perception 
of  talker’s  gender,  but  the  results  were  controversial.  The  methods  used  by  previous 
researchers  cannot  change  one  feature  while  keeping  all  other  features  the  same. 
Since  the  articulatory  synthesizer  can  precisely  control  features,  this  experiment 
seems  more  reliable. 


Experiment  II 

The  objective  of  this  experiment  was  to  study  the  two  control  parameters 
necessary  for  the  synthetic  production  of  breathy  voices:  (1)  the  waveform  of  the 
glottal  area  function,  and  (2)  the  generation  of  noise  at  the  glottis.  These  control 
parameters  were  changed,  individually  and  in  combinations,  to  produce  four  samples 
(including  the  original)  of  each  sentence,  but  the  other  control  parameters  were  kept 
the  same  (for  female  voice).  Figure  5-3  shows  the  waveforms  of  the  glottal  volume 
velocity  in  four  different  situations,  and  Figure  5-4  shows  corresponding  spectra. 
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Figure  5-3.  The  waveforms  of  the  glottal  volume  velocity. 

upper:  Agamp=0.2  sq.cm,  AgO=0  sq.cm,  Q0=  0.6,  Qs=1.5,  without  noise 
generator. 

bottom:  Agamp=0.2  sq.cm,  AgO=0  sq.cm,  Qo=0.6,  Qs=1.5,  with  noise 
generator. 
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Figure  5-3.  (continued) 

upper:  Agamp=0.15  sq.cm,  Ago=0  .02  sq.cm,  Q0=l,  Qs=l,  without 
noise  generator. 

bottom:  Agamp=0.15  sq.cm,  AgO=0.02  sq.cm,  Q0=l,  Qs=l,  with  noise 
generator. 
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Figure  5-4.  The  spectra  corresponding  to  figure  5-3. 

upper:  Agamp=0.2  sq.cm,  Ag0=0  sq.cm,  Qo=0.6,  Qs=1.5,  without  noise 
generator. 

bottom:  Agamp=0.2  sq.cm,  Ag0=0  sq.cm,  Qo=0.6,  Qs=1.5,  with  noise 
generator. 
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Figure  5-4.  (continued) 

upper:  Agamp=0.15  sq.cm,  AgO=0.02  sq.cm,  Q0=l,  Qs=l,  without  noise 
generator. 

bottom:  Agamp=0.15  sq.cm,  AgO=0.02  sq.cm,  Q0=l,  Qs=l,  with  noise 
generator. 
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The  spectrum  showed  that  the  one  with  Q0=l  and  Qs=l  had  steeper  general  tilt,  which 
is  a property  of  breathy  voice. 

During  the  listening  test,  the  judges  were  asked  to  rate  the  degree  of 
breathiness  on  a scale  of  1 to  7.  A rating  of  7 represents  a high  degree  of  breathiness, 
while  a rating  of  1 represents  the  absence  of  breathiness. 

Table  5-2  lists  the  perceptual  ratings  along  with  the  key  source  parameters. 
No  significant  differences  in  the  ratings  were  found  among  the  three  judges.  The 
results  suggest  that  adding  a noise  generator  at  the  glottis  produces  a slightly  breathy 
voice;  using  a sinewave-like  glottal  area  function  only  increases  the  degree  of 
breathiness;  and  using  both  a sinewave-like  glottal  area  function  and  a noise 
generator  results  in  the  highest  degree  of  breathiness.  Since  the  sinewave-like  glottal 
area  function  generates  less  high  frequency  components  than  other  waveforms  do, 
the  high  frequency  components  in  the  synthetic  breathy  voice  have  less  intensity  than 
those  in  the  synthetic  modal  voice.  Due  to  this  the  noise  to  signal  ratio  at  the  higher 
end  of  the  frequency  range  (above  2 KHz)  is  larger,  and  thus  the  voice  is  perceived  as 
breathy  voice. 

Since  the  noise  generator  at  the  glottis  for  this  synthesizer  is  controlled  by  the 
Reynolds  number,  the  glottal  noise  is  amplitude-modulated  (see  the  section  on  the 
glottal  noise  generator  in  this  chapter).  Lee’s  [1988]  experiments  on  synthetic 
breathy  voice  show  that  the  amplitude-modulation  of  the  noise  source  has  a 
significant  effect  on  achieving  perceptual  naturalness.  On  the  other  hand,  the 
spectral  shaping  of  the  noise  source  by  high-pass  filtering  is  less  perceivable.  Lee 
also  claims  that  although  the  location  (within  a pitch  period)  of  noise  production  is 
not  very  critical,  the  perceptual  naturalness  is  improved  when  the  noise  source  is 
located  near  the  point  of  maximum  glottal  closure.  But  according  to  this  experiment, 
the  noise  was  generated  at  the  portion  of  peak  glottal  volume  velocity  which  is  in 
agreement  with  Stevens’  [1971]  results.  For  normal  voicing,  Stevens  concludes  that 
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Table  5-2.  The  source  parameters  and  breathiness  ratings  for  the  two  testing 
sentence  samples. 


Sample 

No. 

Source  Parameters 

Breathiness  Ratings 

Sine-like 

Waveform 

Noise 

Source 

Jl 

J2 

J 3 

Avg . 

1-1 

no 

no 

2 

1 

2 

1.7 

1-2 

no 

yes 

4 

2 

3 

3.0 

1-3 

yes 

no 

5 

3 

3 

3.7 

1-4 

yes 

yes 

6 

5 

5 

5.3 

2-1 

no 

no 

2 

2 

2 

2.0 

2-2 

no 

yes 

4 

2 

3 

3.0 

2-3 

yes 

no 

5 

4 

5 

4.7 

2-4 

yes 

yes 

6 

5 

6 

5.7 

* Scale  of  breathiness  ratings  is  from  1 to  7.  7 represents  a high  degree  of 
breathiness  and  1 represents  the  absence  of  breathiness. 

*Ji  represents  judge  i,  i=l,  2,  3. 

‘Samples  1-*  are  samples  of  sentence  ”We  were  away  a year  ago.” 
‘Samples  2-*  are  samples  of  sentence  ’’Good-bye  Bob.” 
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some  turbulent  noise  might  be  generated  at  the  glottis  during  the  instants  of  peak 
volume  flow,  but  the  duration  of  the  noise  source  during  each  glottal  pulse  is  brief. 
For  breathy  voicing,  the  instantaneous  airflow  is  higher  and  the  intensity  of  the 
turbulence  noise  is  larger. 


Experiment  m 

The  purpose  of  this  experiment  was  to  study  the  three  key  control  parameters, 
namely  (1)  pitch,  (2)  the  waveform  of  the  glottal  area  function,  and  (3)  the  vocal  tract 
area  function  (especially  the  vocal  tract  length).  These  control  parameters  were 
changed  individually  to  synthesize  samples  of  the  sentence  ”We  were  away  a year 
ago.”  Informal  A-B  listening  tests  were  conducted  to  relate  the  quality  of  the 
synthetic  female  voice  to  the  value  of  the  control  parameter. 

First,  three  different  pitch  contours  were  used  to  produce  three  samples.  The 
average  pitch  of  these  contours  were  175  Hz,  190  Hz,  and  250  Hz  respectively.  The 
judges  reported  that:  (1)  The  sample  whose  average  pitch  was  250  Hz  sounded  like  a 
child’s  voice;  (2)  The  other  two  samples  were  good  female  voices,  but  the  one  with  a 
higher  pitch  sounded  more  ’’lively”. 

Second,  three  different  waveforms  of  the  glottal  area  function  were  used  to 
produce  three  samples.  The  open  quotient  Q0,  the  speed  quotient  Qs,  and  the 
minimum  glottal  area  Ag0  of  these  waveforms  were:  (1)  Q0  = 0.6,  Qs  = 1.5,  Ag0  = 0 
sq.cm;  (2)  Q0  = 0.8,  Qs=  1,  Ag0  = 0 sq.cm;  (3)  Q0  = 1,  Qs  = 1,  Ag0  = 0.02  sq.cm 
respectively.  The  judges  preferred  the  samples  produced  by  using  waveforms  with  a 
larger  Q0  and  Qs  = 1. 

Third,  three  different  vocal-tract  scale  factors  were  used  to  produce  three 
samples.  The  scale  factors  for  the  length  of  the  pharyngeal  part  and  the  oral  part 
were:  (1)  1.0,  1.0;  (2)  0.8,  0.95,  and  (3)  0.77,  0.85  respectively.  For  a typical  vocal 
tract  length  of  17.5  cm  the  scaled  vocal  tract  lengths  were  17.5  cm,  15.3  cm,  and  14.2 
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cm  respectively.  The  judges  preferred  the  short  ones.  The  judges  reported  that  there 
were  little  perceptual  differences  between  the  two  short  vocal  tracts. 

Fourth,  since  the  original  synthetic  voices  sounded  a little  ’’tinny”,  the 
low-pass  filtered  samples  were  compared  with  the  original.  To  avoid  large 
distortions  in  the  speech  waveform,  linear  phase  FIR  filters  were  used.  For  the  two 
filters,  there  was  no  loss  in  the  range  of  0 - 3 KHz,  but  6 or  12  db  loss  in  the  range  of 
4-5  KHz.  The  judges  preferred  the  low-pass  filtered  samples.  The  ’’tinny”  sound  in 
the  original  sample  was  produced  by  the  sudden  closure  of  the  glottal  area  function. 
The  use  of  a rounded  corner  in  the  glottal  area  waveform  at  glottal  closure  is 
recommended  for  synthesizing  female  voices. 


Summary 

Based  on  the  results  of  these  experiments,  the  conclusions  regarding  the 
control  parameters  to  achieve  female  voice  characteristics  include: 

(1)  Pitch  contour  is  the  most  important  parameter  for  synthesizing  female 
voices.  The  data  from  the  experiments  suggest  that  the  average  pitch  about  190  Hz  is 
preferred. 

(2)  The  vocal  tract  area  function,  especially  the  vocal  tract  length  is  a key 
control  parameter  for  producing  female  voices.  The  vocal  tract  length  in  the  range 
from  14.2  cm  to  15.3  cm  is  proper  for  female  voice. 

(3)  The  waveshape  of  the  glottal  area  function  is  another  key  control 
parameter  for  achieving  female  voice  characteristics.  The  open  quotient  larger  than 
0.8  and  the  speed  quotient  near  1 are  preferred.  To  generate  the  spectral  tilt  more 
than  12  db  per  octave,  witch  is  important  for  female  voices,  a rounded  corner  at 
closure  should  be  added  to  the  glottal  area  function. 

(4)  The  turbulent  noise  at  glottis  is  a key  parameter  for  producing  breathy 


voice. 
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Voice  conversion  was  a useful  method  for  identification  of  the  key  control 
parameters  for  synthesizing  female  voices.  The  key  control  parameters  for 
synthesizing  female  voices  were  identified  by  synthesizing  a prototype  male  voice, 
then  systematically  changing  the  selected  control  parameters  in  a prescribed  manner 
to  re-synthesize  female  voices.  The  listening  tests  determined  the  ability  of  a 
particular  parameter  to  successfully  synthesize  a female  voice  from  the  original  male 
voice. 


CHAPTER  6 

CONCLUSIONS  AND  DISCUSSIONS 


Summary 

This  research  investigated  several  aspects  of  articulatory  speech  synthesis. 
First,  a stable,  flexible  and  computationally  efficient  articulatory  speech  synthesizer 
is  developed.  Second,  nasal  sounds  and  female  voices  are  synthesized  using  this 
articulatory  synthesizer.  Finally,  perceptual  evaluations  are  conducted  to  correlate 
the  control  parameters  of  the  synthesizer  to  the  quality  of  the  resulting  synthetic 
speech.  This  chapter  provides  a synoptic  discussion  of  these  accomplishments  as 
well  as  a sense  of  direction  for  future  research  efforts  in  this  area.  The  achievements 
of  this  research  were  as  follows. 

A new  articulatory  synthesizer.  A stable,  flexible  and  computationally 
efficient  articulatory  speech  synthesizer  was  developed.  A major  factor  to  be 
considered  in  implementing  an  articulatory  synthesizer  is  the  stability  of  the 
algorithm  used  to  solve  the  acoustic  equations  of  the  vocal  system.  Since  the  vocal 
system  is  a non-linear  and  time-varying  system,  the  stability  of  the  algorithm 
becomes  more  critical.  From  the  discussion  in  Chapter  2,  this  research  effort  proved 
that  the  acoustic  equations  of  the  vocal  system  are  essentially  stiff  differential 
equations;  thus,  the  algorithm  used  to  solve  them  must  be  suitable  for  stiff  equations. 
Selection  of  the  trapezoidal  method  is  based  on  the  fact  that  this  algorithm  is  stable 
for  any  step  size  provided  the  eigenvalues  of  the  differential  equations  are  negative. 

Using  an  efficient  algorithm  is  also  a feature  of  this  articulatory  synthesizer. 
Since  there  were  no  reliable  methods  to  derive  control  parameters,  many  trials  were 
needed  in  order  to  produce  a satisfactory  synthetic  speech  sound.  In  addition,  due  to 
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the  heavy  computational  loads  involved  in  articulatory  synthesis,  longer 
computational  time  is  required  to  produce  synthetic  speech.  For  example,  Bocchieri 
[1983],  reports  a computational  time  of  5 hours  for  1 second  of  speech  on  a Data 
General  Eclipse  S/130.  These  factors  make  the  efficient  algorithm  a key  requirement 
of  articulatory  synthesizers.  Acoustic  equations  were  simplified  by  using  associated 
discrete  circuit  models  and  circuit  analysis  theory.  Directly  simplifying  the  acoustic 
differential  equations  involves  hundreds  of  variables  and  equations,  so  it  is  not  a easy 
task.  The  associated  discrete  circuit  models  let  us  use  the  powerful  circuit  analysis 
methods  during  the  simplification  of  acoustic  equations.  Now,  about  4 minutes  of 
computational  time  is  required  to  synthesize  1 second  of  speech  on  a VAX-11/750 
computer.  The  significance  of  the  amount  of  reduction  in  the  computational  time  is 
quite  obvious  when  compared  to  Bocchieri’s  result. 

Flexibility  is  another  feature  of  the  proposed  synthesizer.  Although  a 
distributed  element  transmission  line  representation  is  widely  used  for  modeling  the 
vocal  tract,  the  choice  of  component  values  is  not  consistent  among  users  [Wakita 
and  Fant,  1978].  Besides,  there  are  variations  among  talkers  in  cavity  wall 
impedance,  glottal  and  subglottal  impedance,  and  the  nasal  cavity  system.  More 
investigations  are  needed  to  establish  reliable  rules  for  correctly  choosing  these 
component  values.  As  with  any  reliable  research  investigation,  a means  of  changing 
control  parameters  needs  to  be  made  readily  available.  With  the  proposed 
articulatory  synthesizer,  the  parameters  corresponding  to  vocal  folds,  the  wall  of  the 
vocal  tract,  the  shape  of  nasal  tract,  and  noise  generation  can  be  easily  specified  by 
the  user. 

Major  control  parameters  for  nasals.  The  quality  of  the  synthetic  nasal  sounds 
produced  by  the  articulatory  synthesizer  is  good.  The  experiments  show  that  nasality 
depends  primarily  on  the  velopharyngeal  opening  and  that  the  maxillary  sinuses 
have,  at  best,  a minimal  effect  on  the  quality  of  nasal  sounds.  The  amount  of 


123 


velopharyngeal  opening  for  nasals  as  determined  by  the  experiment  is  in  agreement 
with  the  data  found  in  the  literature.  This  indicates  that  the  articulatory  synthesizer 
effectively  simulates  the  human  vocal  system. 

Major  control  parameters  for  female  voice.  Although  pitch  is  the  main 
feature  used  to  distinguish  the  gender  of  talkers,  the  articulatory  synthesizer  cannot 
produce  natural-sounding  female  voices  by  simply  changing  the  pitch  contour. 
According  to  experiments  as  detailed  in  Chapter  5,  in  order  to  produce  female  voice 
quality,  all  of  the  following  control  parameters  need  to  be  represented  with  correct 
values:  (1)  the  pitch  contour,  (2)  the  vocal  tract  shape,  and  (3)  the  pattern  of 
movement  of  the  vocal  folds.  It  is  important  to  add  a rounded  corner  at  closure  to  the 
glottal  area  function  for  synthesizing  female  voices,  whose  spectral  tilt  is  more  than 
12  db  per  octave. 

Breathiness  is  a common  feature  of  the  female  voice.  This  research  suggests 
that  using  a sinewave-like  glottal  area  function  with  incomplete  closure  is  important 
for  generating  breathy  voices.  This  is  because  the  incomplete  closure  of  the  vocal 
folds  produces  more  sinewave-like  glottal  volume  velocity  and  generates  more  noise 
at  the  glottis.  Since  the  sinewave-like  glottal  volume  velocity  produces  speech 
waveforms  whose  high-frequency  components  have  less  energy,  the  noise  in  the  high 
frequency  range  is  more  noticeable.  Therefore,  the  synthetic  speech  is  perceived  as 
breathy  voice. 


Directions  of  Future  Research 

Articulatory  speech  synthesizers  have  the  potential  to  produce  high-quality 
synthetic  speech.  The  results  of  this  research  are  very  encouraging,  but  much 
remains  to  be  investigated.  Further  research  is  suggested  in  several  areas. 
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Deriving  Control  Parameters 

In  order  to  successfully  synthesize  speech,  it  is  necessary  to  supply  the 
articulatory  synthesizer  with  a sequence  of  control  signals  which  are  appropriate  to 
the  details  of  the  required  utterance.  The  intelligibility  and  naturalness  of  the 
resultant  speech  will  depend  primarily  on  the  quality  of  the  control  parameters. 
Determining  the  control  parameters  for  producing  intelligible  and  natural-sounding 
speech  is  a very  difficult  task;  however,  it  is  an  important  area  for  future  research. 

Vocal  tract  parameters.  The  main  reason  for  the  rather  slow-paced  advance 
in  developing  high-quality  articulatory  synthesizers  is  the  lack  of  reliable 
physiological  data,  especially  with  respect  to  the  female  vocal  and  nasal  tracts.  Very 
little  original  data  on  the  area  function  of  the  vocal  tract  have  accumulated.  This  is 
due  to  the  hesitancy  in  exposing  subjects  to  X-ray  radiation.  Although  several 
techniques  have  been  proposed  to  derive  area  functions  from  speech  waveform 
[Wakita,  1979;  Levinson  and  Schmidt,  1983],  they  have  failed  as  yet  to  provide  a new 
reference  material.  A reliable,  non-invasive,  simple  method  to  obtain  the  vocal  tract 
area  function  is  needed. 

Vocal  fold  parameters.  It  has  long  been  suspected  that  the  quality  of  synthetic 
speech  may  be  improved  at  the  acoustic  level  through  an  appropriate  model  for 
sources  during  voiced  segments  of  speech  [Childers  and  Wu;  1989].  It  is  also  known 
that  the  pattern  and  periodicity  of  the  vocal  fold  movements  are  subject  to  large 
variations  among  different  utterances  and  different  talkers  [Holmes,  1973].  Holmes 
felt  that  the  failure  to  simulate  such  detail  of  the  glottal  pulse  could  be  a significant 
factor  for  explaining  the  unnatural  quality  of  synthetic  speech.  Therefore,  a reliable, 
non-invasive,  simple  method  to  obtain  the  glottal  area  function  is  needed. 
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Modification  of  the  Acoustic  Model 

Although  the  control  parameters  are  the  major  factors  for  high  quality 
synthetic  voice,  the  acoustic  model  of  the  vocal  system  also  needs  to  be  improved. 
The  consonant  models  are  still  rather  primitive.  The  difficulty  involved  in  modeling 
all  relevant  factors  in  the  acoustic  production  process  has  not  been  overcome  as  yet 
[Fant,  1980]. 

First,  the  subglottal  system  should  be  included  in  the  acoustic  model  of  the 
vocal  system.  Although,  in  normal  voice  production,  the  influence  of  the  subglottal 
system  appears  to  be  small,  it  should  be  considered  if  there  is  constant  leakage 
bypassing  the  vibrating  part  of  the  glottis.  For  example,  in  a breathy  voice,  the  effect 
of  the  subglottal  system  on  the  vowel  spectrum  is  to  create  a spurious  peak  at  about 
2150  Hz,  and  to  modify  harmonic  amplitudes  in  the  Fi  to  F2  region  [Klatt,  1987], 

Second,  the  noise  generator  for  consonant  sound  must  be  improved.  In 
natural  speech  unvoiced  sounds  are  produced  either  by  aspiration  or  by  frication. 
The  generation  of  turbulence  noise  is  highly  complex  [Maeda,  1982a].  A random 
number  generator  is  a practical  solution  for  the  production  of  certain  classes  of 
consonants,  but  may  not  be  considered  as  a simulation  of  the  turbulence  created  by 
airflow  passing  through  a narrow  constriction.  A new  model  of  turbulence  noise  is 
needed  for  improving  the  quality  of  synthetic  consonants  and  breathy  voices. 

Third,  correctly  modeling  the  nasal  tract  is  important  for  synthesizing  nasal 
sounds.  Thus,  a continuation  of  the  research  reported  here  is  to  study  how  to 
improve  the  current  nasal  tract  model.  By  comparing  the  energy  and  spectra  of  the 
volume  velocity  at  the  nostrils  and  the  mouth  between  the  simulating  results  from  this 
articulatory  synthesizer  and  the  real  data  measured  from  human  subjects,  the  correct 
parameters  for  the  nasal  tract  may  be  found. 


APPENDIX 

STEPS  FOR  ARTICULATORY  SYNTHESIS 


For  an  articulatory  speech  synthesizer  there  is  no  analysis  method  available 
to  obtain  the  control  parameters.  But  the  natural  speech  spectrogram  is  quite  useful 
in  obtaining  a good  estimate  of  the  required  duration  of  each  phoneme.  In  addition, 
the  pitch  and  the  intensity  contours  provide  the  information  to  assign  pitch  period 
T and  subglottal  pressure  Ps.  Although  a human  speech  sample  was  used  to  derive 
some  control  parameters,  no  attempt  was  made  to  match  the  human  speech,  since 
there  is  no  theory  that  tells  us  how  to  manipulate  the  control  parameters  to  achieve 
the  desired  sound  synthesized  by  using  an  articulatory  synthesizer. 

A block  diagram  of  the  steps  used  in  our  articulatory  synthesis  is  shown  in 
Figure  A-l.  Explanations  of  these  steps  are  as  follows. 

(1)  Analyze  the  EGG  signal  to  obtain  the  pitch  contour.  Approximate  the 
pitch  contour  by  using  line  segments.  This  step  assigns  the  pitch  period  T. 

(2)  Analyze  the  speech  signal  to  obtain  the  intensity  contour.  The  intensity 
helps  us  to  assign  the  subglottal  pressure  Ps. 

(3)  Create  a glottal  file  based  on  step  (1)  and  (2).  In  this  file  every  record 
has  seven  fields.  They  are  target  time,  t;  pitch  period,  T;  minimum  glottal  area, 
Ag0;  amplitude  of  glottal  area  function,  Agamp;  open  quotient,  Q0;  speed  quotient, 
Qs;  and  subglottal  pressure,  Ps.  A sample  file  list  appears  in  Table  A-l. 

(4)  Assign  phoneme  duration  based  on  the  spectrogram  and  the  context  of 
the  speech.  Choose  an  area  function  for  every  phoneme  from  the  database  of  the 
area  functions.  For  this  research  we  use  the  area  functions  from  data  collected  by 
Fant  [I960]. 
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EGG 


SPEECH 


Figure  A-l.  Steps  in  articulatory  synthesis. 
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(5)  Create  a tract  file  based  on  step  4.  In  this  file  each  record  has  four  fields. 
They  are  target  time  t,  filename  of  the  target  vocal  tract  area  function,  the  section 
number  of  velum,  N;  and  the  opening  of  the  velopharyngeal  port,  Av.  A sample 
file  is  shown  in  Table  A-2. 

(6)  Synthesize  the  speech  by  using  the  glottal  file  and  the  tract  file. 

(7)  Play  back.  If  a particular  phoneme  does  not  sound  as  expected,  choose 
an  other  area  function  for  that  phoneme.  The  area  functions  of  phonemes  are  often 
affected  by  the  next  phoneme,  especially  for  consonants.  If  nasality  is  a problem, 
adjust  the  opening  of  velopharyngeal  port  Av.  If  there  is  a problem  with  the  loudness, 
adjust  subglottal  pressure  Ps,  or  change  the  amplitude  of  the  glottal  area  function, 
Agamp-  If  voice  quality  is  a problem,  change  the  open  quotient,  Q0;  speed  quotient, 
Qs;  and  the  minimum  glottal  area,  Ag0. 

(7)  Re-synthesize  the  speech  to  assess  if  the  changes  made  were  appropriate. 

An  example  of  input  for  synthesizing  word  ’’mining”  is  shown  in  Figure  A-2. 
The  output  is  shown  in  Figure  A-3. 
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Table  A-l  Sample  glottal  file  for  words  ’’mining”. 


t 

T 

Ago 

Agamp 

Qo 

Qs 

Ps 

0.51 

0.011 

0.0 

0.15 

0.7 

1.5 

7. 

0.565 

0.012 

0.0 

0.2 

0.7 

1.5 

7. 

0.637 

0.0115 

0.0 

0.2 

0.7 

1.5 

7. 

0.7175 

0.012 

0.0 

0.15 

0.7 

1.5 

8. 

0.7535 

0.0126 

0.0 

0.2 

0.7 

1.5 

8. 

0.8717 

0.013 

0.0 

0.15 

0.7 

1.5 

7. 

0.92 

0.013 

0.0 

0.15 

0.7 

1.5 

2. 

Table  A-2  Sample  tract  file  for  words  ’’mining”. 


t 

filename 

N 

Av 

0.51 

MM_.AF 

30 

0.34 

0.57 

MM  .AF 

30 

0.4 

0.58 

AAA.AF 

30 

0.4 

0.66 

EEE.AF 

30 

0.2 

0.71 

EEE.AF 

30 

0.34 

0.72 

NNN.AF 

30 

0.4 

0.76 

NNN.AF 

30 

0.4 

0.77 

1 1 1.AF 

30 

0.4 

0.83 

1 1 1.AF 

30 

0.34 

0.84 

NG.AF 

30 

0.4 

0.92 

NG.AF 

30 

0.4 
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Figure  A-2.  The  input  for  synthesizing  word  ’’mining”, 
upper:  Pitch  contour, 
middle:  Subglottal  pressure, 
bottom:  Velopharyngeal  opening. 
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Figure  A-2.  (continued) 

Area  functions. 
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Figure  A-3.  The  output  of  synthetic  word  ’’mining” 
upper:  Volume  velocity  at  the  glottis, 
middle:  Volume  velocity  at  the  mouth, 
bottom:  Volume  velocity  at  the  nostrils. 
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Figure  A-3.  (continued) 

upper:  Pressure  at  glottis, 
bottom:  Speech  waveform. 
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